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Foreword 


P UBLIC EXAMINATIONS in India play a pivotal role m determm 
ing the functional content and method of education They 
carry a good deal of prestige and act as a passport for admission 
to higher courses as also for entering jobs The agencies conducting 
public examinations are quite conscious of their responsibility and in 
their own way take various steps to ensure that they are run as efficiently 
as possible Adequate precautions are taken in the appointment of paper- 
setters, in instructions to examiners, etc , in order to minimise the factor 
of unreliability 

Nevertheless, studies conducted both in India and outside ha\e brought 
out \ery clearly the limited value of such examinations Dr A E 
Harper’s study is one such One major difficulty m conducting such 
studies in India has been the faith placed in external examinations and 
the element of secrecy attached to them Dr Harper has succeeded in 
overcoming this difficulty and his research is based on live data obtained 
in simulated conditions 

He has studied a sample of answer scripts in four high school sub- 
jects by getting them examined by more than one examiner This 
massive study has brought out the numerous deficiencies inherent in 
external examinations His findings will come as an eye-opener both to 
those functioning at various levels m examining agencies and to policy- 
makers in the field of school education They will also be useful to 
decision makers and will bring home to them the fact that the convcn 
tional steps taken by the examining agencies are not adequate to over- 
come the evils of the sjstem It is hoped that Dr Harper’s findings 
will open up new vistas for reforms in the present system of external 
examinations. 


New Delhi 
April 1975 


Rais Ahmed 
Director 

Nmicml Cornell of Educational Research and Trammg 



To The Reader 


T HIS REPORT is addressed to all who are interested in our 
present examination system, and m its reform 
Some of you are highly trained experts, who can talk of 
sigmas, and standard errors of measurement, as easily as about the price 
of mangoes 

Others of you are teachers, administrators, educators, students, 
parents You know nothing about statistics, and even the coefficient of 
correlation is an enigma to you It is to you that this chapter rs especially 
addressed 

You will find this report written on three levels This means that 
there is some repetition. But it also means that it will be readable and 
useful to a larger number than might otherwise have been possible 
First, for the impatient, is a chapter of findings alone (Chapter II) 
This is just a series of statements, without presenting any evidence This 
is the chapter labelled “Briefly Summarized”. You may also find this use- 
ful for quotation 

Second, for those who are interested in the major findings, but not 
in the detailed procedures and statistical analyses, yon will find the “High- 
lights” (Chapters HI and IV) most useful These are non-technical and 
can be read, and understood, by any teacher, parent, or other interested 
person In addition to this, we suggest that you at least glance at the 
Tables and Figures in the rest of the booL These will g»e you a good 
overview of what we have found 

Third, the rest of the book is for those who want all the detail*. 
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Part Two — Detailed Reports begins with Chapter V which reviews other 
research in this field, particularly that done in India. Because this has 
never been reviewed before, we believe the reader will be surprised to 
find how much has already been done. Chapter VI describes the experi- 
mental design, and Chapter VII which attempts to clarify terminology 
and concepts, will be of interest to those with a serious concern in this 
field. Detailed reports follow on history, Hindi, biology and mathematics 
examinations. Mathematics is placed last because it is popularly consi- 
dered the most reliable of these four examinations. The reader who seri- 
ously expects this is in for a surprise. Finally, there are three chapters of 
recommendations, with an emphasis on the practical, “it can be done 
now” type. 

Part Two is, of course, primarily for those with a knowledge of 
psychometric statistics. But others will find it useful , also. Try it, e\en 
if you are not an “expert". Every technical term has been explained 
briefly, when it is introduced. And there are, of course, many parts which 
require no “technical" knowledge whatsoever. 

In Part Two, we are sure that you will want to read at least the 
section related to the subject you teach. Do not be worried if there are 
parts you do not understand— you will certainly find much of interest 
in the parts that you can understand quite clearly. 

There is one more way to review this book, if you do not ha\e time 
to read it all: look at the Index. Just read through the Index, find topics 
that interest you, and Jook them up. This way you can go through the 
book quite quickly, and yet not miss anything which is likely to be of 
importance to you. It would be best, however, if you read Chapters I 
and II first, before using the Index to skim the rest. 

Finally, here are some of the questions to which you can seek answers 
in this book: 

“How reliable are our traditional type of.examinations?" 

“What happens when 9ft experienced examiners mark the same 
answer books?” 

“If an examiner re-examines a batch of answer books, will he give 
them the same marks the second time?" 

“How meaningful is a First Class?” 

"Do experienced examiners agree on pass percentage for the same 
answer books?” 

“What element does ‘chance’ play in passing, failing, and Divisions?” 
“Will it help to have several examiners mark each answer book, 
. and average the results ? Is there a cheaper way than this to make 
traditional-type examinations more reliable?" 

“What improvements can be introduced without major changes in 
the present type and pattern of examinations?" 
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Introduction 


T HIS BOOK reports on the reliability of examinations in India 
Under two different experimental conditions 4010 answer books 
in four different High School subjects were marked again by 
130 examiners 

Why do such studies? 

There has been a tremendous amount of research on examination! 
ia Europe and America Why waste time and money repeating this re- 
search in India? Specifically, why these particular studies? 

Unfortunately, 'try few modern foreign studies apply directly to 
the Indian situation This is because most research minded examination 
systems have long since given up essay type examinations except for 
very specialised uses The research on which they based their decisions, 
most of it done decades ago, is no longer widely available Thus research 
on wide scale examinations, involving many examiners seems still to be 
of value 

Also, somehow, research done cJsewhereis never really con naans 
m a field like this There is a deep conviction among examiners that 
‘ ne do things differently Other examiners (especially examiners m other 
countries) may be unreliable, but my marking is always accurate. Even 
experts m educational measurement, fully aware of all the research re 
suits, may still suffer from this conviction that they, themselves, art 
somehow exempt from this charge of examiner unreliability 

More basically, however, there is some teason to believe that the 
marking of answer books should be more reliable in India than abroad. 
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factors should be considered in marking, nor as to the general 
achievement expected. In India, on the other hand, the educational 
system within a particular state is uniform, with a fairly rigid cumculu . 
A state-wide examination system isbu.lt on the assumption that expe- 
rienced examiners da have common standards. There is, therefore, some 
justification for traditionalists who reject Western examination studies 
saying, “These results do not apply to India— we do things differently 
here." Because of this, Indian studies are needed in India. Dr. H. J. 
Taylor, of Gauhati University, seems to be the only scholar who has 
responded in recent years to this vital need to study the marking ot 
examination books under Indian conditions. (Dr. Gayen and others have 
studied marks, not marking.) Since Dr. Taylor’s approach is somewhat 
different from ours, it is hoped that the two sets of studies can be consi- 


dered complementary to each other. 

One of the two studies, in which 90 experienced examiners re-marked 
the same ten history answer books, has a special justification. One 
criticism of the only western studies of this type of which we were aware 
when we began this research, is that they examined the re-marking of 
only a single answer script, in each subject. Everyone “knows” that 
examiners differ in standard. Might it be, however, that a group of 
examiners could at least agree on which of a pair of answer books is 
better — or, more generally, agree in placing a group of answer books in 
the same order of merit? If there is such agreement, then the disagreement 
on actual marks is irrelevant, as these can be adjusted by “scaling”. Thus 
it is important to know if agreement on order of merit exists. 

This study shows that unfortunately even scaling would not change 
the fact that traditional essay-type examinations are unreliable. This 
unreliability has been proven again and again. Yet we continue to be 
convinced that each one of us, personally at least, can mark an answer 
book accurately. Why? 

It is interesting to speculate about the sources of this conviction of 
personal perfection in examination marking. Is it because we are unable 
to face the consequences of not believing that our marks are reliable? Is 
it because, after the tremendous investment of time and effort in marking 
a set of answer books, we must believe that we have marked them 
correctly because to admit that all that time and effort have really been 
wasted is almost intolerable? Perhaps m most areas in which personal 
judgements are demanded, especially those judgements which affect the 
jncs of others, we are almost forced psychologically to a conviction of 
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infallibility This conviction is the only defence against an overwhelming 
sense of personal responsibility, if we admit that our judgements tnay be 
seriously wrong History is filled with instances of wrong judgements by 
persons in high authority How many have ever publicly admitted that 
they were in error? 

Ultimately, perhaps every skeptical examiner may need to be forced 
to re mark hts own papers before he will really believe that he himself 
is unreliable Fortunately, many who are less self important will realize 
that the results shown m this book apply equally to their own personal 
work This is the justification for the present studies 
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Briefly Summarized 


T HIS BOOK reports two different studies Both are based on 
the actual answer books written by Class X students in Higher 
Secondary Board Examinations in India Both involve expe- 
rienced examiners who have worked for several years under the same 
Higher Secondary Board 

One is the “ninety marking ten” experiment Ninety photographi 
cally exact copies were made of each of ten history answer books So 
each of the ninety experienced examiners marked the same ten answer 
books They marked them as carefully as possible 

The other study is called the * four thousand re-examined” study 
Four thousand answer books were selected — one thousand each in 
history, Hindi, biology, and mathematics (geometry) Forty examiners — • 
ten in each subject — were asked to mark them a second time Each 
examiner marked 50 answer books that he himself had previously 
marked, and 50 that had been marked previously by another examiner 
The answer books were mixed up, so the examiner did not know which 
were his and which were another’s 
Now for some of the findings 

What happens wken ninety experienced examiners mark the same ten 
history answer books ? 

** One of the answer books was considered the best of the ten by 
one experienced examiner, and the norst of the ten by another 

The only answer book that was considered worthy of a 
Distinction by one experienced examiner, was Failed by seven Eight gave 
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.. Probably 4,000 out of every lakh of candidates are awarded 
. h JL 15 0 / or more above or below what they should have 

received i e they are one full Division lower or higher than the marks 

they would have received if their answer books were marked without error. 

What happens when forty experienced examiners mark four 
history, Hindi, biology and mathematics (geometry ) answer books a second 

*» If four thousand answer books are marked a second time, about 
52 of them will be raised or lowered by two Divisions. For examp e, 
Firsts will receive a Third, and Failures will be awarded a Second, bven 
in mathematics, 11 out of one thousand will be changed by two Divisions. 

If these candidates are administered a second examination, the number 01 
two-Division changes will be much larger. 

** On the average, examiners who mark the same candidates twice, 
will give the same Division or Class both times to only about 65% of to® 
answer books. For thirteen out of one thousand, the two marks will 
differ by two Divisions. 

•* Mathematics examinations are only slightly more reliable than 
history examinations. The agreement on Divisions is 77% for geometry 
and 60% for history. Objective-type examinations, on the other hand, 
give an agreement above 90% for two separate examinations — and, of 
course, 100% for two examiners of the same answer books. 

** Examiners are inconsistent in pass percentage, the number of 
candidates awarded each Class or Division, and even the marks awarded 
individual candidates. They not only disagree with each other, but they 
even disagree with themselves when re-doing their own work. 

*• Out of 229 First Divisioners, 83 were awarded a Second or Third 
by another examiner. Twenty-two (mostly in geometry) were raised to a 
Distinction. 

** Twenty-seven per cent of all candidates failed by one examiner 
were passed by another examiner. 

** Th c average difference between examiners is nearly six per cent 
of marks. Some pairs of examiners — even in geometry — differ by as 
many as 20 marks out of 50. 

Examiners changed the pass percentage by as much as 58% 
when they re-marked a batch of answer books. Even a single examiner’s 
pass percentage varied as much as 58% from one marking to the next. 

When an examiner marks answer books a second time — and he 
marks some that he himself had marked first, and some that were first 
marked by another examiner-he will agree with his own first marking 
only slightly better than he will agree with another’s marking. Examiners 
m history, Hindi, biology and geometry all agreed with each other only 
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Highlights of “ Ninety Marking Ten’ 


T HIS STUDY set out to answer the question, “What happens 
when one hundred experienced examiners are asked to mark the 
same ten answer books ?’ # 

Ten history answer books were selected. The original examiners 
marks of these ten answer books spread from 7 to 28 marks out of 50. 
Forty per cent had passed and sixty per cent had failed, according to the 
original examiners. 

We especially selected answer books in which the candidates had 
answered the same five (out of ten) questions. This was in order to 
reduce the variability due to different choices of questions. This makes 
the reliability of the examination in this study higher than the regular public 
examination in history. (The reliability when different candidates choose 
different questions is reported in the “four thousand re-marked" study.) 

The instructions issued to the examiners were the same one3 which 
had been issued for the regular examination. It was not felt necessary 
to send any “model answer books", as all these examiners had already 
marked the same examination several months earlier. 

Ten examiners did not return their marks sheets, so this study is 
based on the remaining ninety. All ninety examiners were regular 
experienced examiners of the same Board of Secondary Education 
in India. They marked the examination books as carefully as possible. 

1. What happens when ninety experienced examiners mark the same ten 
history answer books ? 

Table 3.1 shows the marks awarded to each of the ten answer books. 
These are marks out of 50. 
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TABLE 3 1 

Frequency Distributions of Marks Awarded by 90 Experienced 
Examiners to Each of 10 History Answer Books 



as 
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For simplicity, the ten answer books (A to J) are arranged m order 

0ft ^rTm ( rthe^,nmn shows the marks awarded by N 
exoerienced examiners to candidate D One examiner awarded him 35 
marks Fwe examiners awarded him 34 marks A. the bottom of , c 
column, we find that one examiner awarded him 17 marks out or , 
awarded him 18, etc At the very bottom we sec that the average mark 
awarded by the 90 experienced examiners to answer book D was -7 m 
out of SO And the range of marks, from the lowest to the highest, is 
The lines across represent the Classes or Divisions Those below tne 
first line have failing marks Thorf between the first and second lines 
are in the Third Division The top line, above 38 marks, is a Distinction 
Look at the second column, for answer book E This candidate 
was awarded 38 marks out of 50 by one examiner— and only 1 1 marks 
out of 50 by another The only one of the ten candidates who was thought 
worthy of a Distinction by one examiner was Tailed by seven 

The only candidate who was failed by all the examiners was Candi- 
date 3, and even he was awarded a spread of marks from 2 to 15 

The range of marks is interesting There was such wide disagreement 
that some answer books received marks over a range of 28 out of 50 
Even where there was the maximum degree of agreement among 
examiners (candidate B) the range was still 13 marks out of 50 

This range is somewhat related to average marks Tho^e with the 
highest averages were awarded the largest range of marks However, 
look at candidates B and H Their averages were both 1 1 marks out of 
50, but their ranges were 13 and 17 There are some qualitie* on the 
importance of which examiners agree, and some on which they disagree 
When an answer book contains mainly the former qualities, the range 
of marks will be narrower 

What does all this mean 7 Teachers and students are rightly 
concerned with the maximum error possible in the present system of 
examinations Answer book E shows this (If we had had 100 answer 
books instead of only ten, the maximum error would probably have been 
even larger ) Obviously, the correct mark for this candidate is 23, the 
average of the 90 experienced examiners Suppose that in a large public 
examination there are, say, 1,500 candidates who, like E, “should’ 
receive 23 marks out of 50 Out of these, as many as 17 may receive 
a Distinction, while about 117 of them will fail— even though they are 
all actually of equal merit 

2 Are these differences just random errors ? 

mistTvl the T rC lS lhe P° ssibll,t y tb at examiners made just one 
mistake out often It is possible that they agreed on almost all candidates, 
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but lost happened to make a mistake on one-and that ff * h ' s 
mistake 1 was foe a different eand.da.e for each etammer this could 
produce the wide variations in Table 3 1 

Is this what happened ’ want to see for 

y „oS ah tha masks o f a„ the 

Here 

and 69) oho though they awarded almw 
marks differed widely on many of the candidate 


Cand date 


*G 

S 

Mean 


Frt3t 
exam ner 
38 
29 
26 
16 
12 
10 


Second 
exam ner 
20 
20 
15 
20 

15 
19 
18 
13 
13 
13 

16 9 


Difference 

18 


6 

~ 79 " 


«=- « CtoI " D, "Z 1 alreadj obvious from Table 5 1 
The answer to this q««»“ “ a ' r ' a 

hit ,t ,s sharp-ned up in Table 3- 
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of 501 that he was failed by all examiners tfl . 

The highest number of examiners awarding the same Division^ 
candidate Mas 50 out of 90, which is only a Mile over ha 

statements such as, -He is a First Divisionet or He is a Failed stud 

do not carry a very accurate meaning As we shall also see in the 


TABLE 3 3 

Variation in Pass Percentages when 90 Experienced Examiners 
Mark the Same 10 History Answer Books 

1 examiner passed 8 candidates (80° „) 

4 examiners „ 7 , (70%) 

14 „ „ 6 „ (60%) 

16 „ ,5 ., (50%) 

15 ,4 „ (40%) 

23 .. 3 (30%)" 

13 „ 2 (20%) 

4 „ „ 1 candidate (10%) 

study. Class or Dmsion depends nearly as much on pure chance, (that 
is on which examiner marks the candidate’s answer book), as it depends 
on the candidate’s own actual merit 


4 Can we trust “Pass Percentage ’ 4 * * 7 

We often evaluate teachers, and educational institutions, in terms 
of their pass percentages Is this fair n 

Remember that under the present system, the history answer books 
of one higher secondary school will go to one examiner, while those of 
another will go to a different examiner (The only exception is Assam 
where, m some examinations, the answer books are mixed up together 
before sending out) Under these circumstances, what does the “pass 
percentage” really tell us about a teacher or institution 

Table 3 3 shows how widely experienced examiners will differ on the 
pass percentage of the same batch of answer books For one examiner 
the pass percentage was 80%, while for another four it was only 10% 
For 19 examiners it was above 60%, while for almost the same number 
(17) it was below 20% Obviously, pass percentage tells us more about the 
examiners than it does about the candidates or their class teachers 

The variation in pass percentage from year to year in the public 
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examinations, and the venation 

teacher to teacher, just tells us ,tat * £ '* ° s " ts Readers may 
Taylor's (1962e. 1963a) studies have shown “milarjesu,^ „ f 0auhat , 

know that Dr Taylor, who sv . B ( no t of psychology or 

University, was formcrlj [> ‘ P™ r « U^f examinations as measure 
education) He approached t P apprMc h the problem of 

ments in the same rigorous y H exclusion about the 

the reliability of any measurement tn physics 

pass percentage (1963, page 23) was examiners 

“Whether they eneounhrr Tr f”. depends This over- 
is a matter of ch:1 "“' a " h lc „ t 0 r which has not previously 
whelming element of chance, Ihecxtem ^ for the fadure 

** and umvcrs,,y 
“randomization and sealing „ as on£ quarter This dearly 

5 Con examiners agree an «h° « b “' do dl[rcr in standard 

Of course “everybody “ "“ha. it is true ) Even .hough 

22 st '" 

same instructions or the same of a 

differed quite widely [east agree on the rein' 

dEKSSSssagF- 

Zero, and award the rest m mclhods which can be a shre , 

more detailed d'scussi looked at each cxami given 

W ha. we did »“‘ h ' S J awar ded by the fiB „nked “2 , 

separately The high es. mar,^ ^ same exammer ^ ^ ^ the 
a rank of “1 . The n « first examiner m , thc n mety 

and so on down to » and lhe third, on thi ■ J ^ best 
same for the second exam ■ Ans wer book D “ considered it 

Table 3 4 shows the reso.^ Twetw ^ ^ A 

(rank 1) by V of lh = 9 an d one considered J of ne nt) of 
,he second-best of the , the average rank (or 
the bottom, you can 4 ar e in the sam 

candidate D's marks was boo ks in Tab 

For convenience, the ans 
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highest agreement on ranks is for the best candidates Is it more 
Lportant to be sure that you fad all the weak students or ” £ 
important to be sure that the best students are awarded the marks 

lhC> Tte7', of course, a question of educational phitaphy^at^ 

beyond the scope of 7 this "™"'more Xpendint on the leader 
r p ,C 0 H E h 0 °s= at the fop of the 

average ability If so. then., is stank j these ntnk 

request from examiners The Brar nl ifie h procedures (see chapter on 

Scahng) "under" such a^em, 

;r'r tveZ pperth/singl/.^ tank from 23 to 30,. and the average 
mark would have been 29 (instead of 27) 


We have already quoted But 

bcIpSledTu. more spec, heahy, - — » “ ’*= 

cruca. decision for an — ^ &%££ 
mean failure on the examma.ion |e ^ assume th a, the average 

mar»«--— S ’ h ' , 

candidate < , „ M ca„d,da,eC. who« - Out of 

But Idh 38 P“-' “ f r;t Hme^uld have faded this eandi- 
90 examiners, 24 (or 11 A, only examiners 

“"miner ^>“tt he faded n,,/y because by 
one chance in four 
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not lime that we begin to use it ■> 


7 Ho it reliable is 1/ns examination ’ 

In the layman’s sense of the term, we have been d'tcuss'ng rcliab.lity 
(and unreliability) all along But many readers may be used to the "t 
technical use of the term “reliability" as tt relates to tests . and 
examinations Particularly, they are interested in two speciHc measure, 
of reliability, the coefficient of reliability and the standard erro 01 
measurement (The non statistical reader will find a layman’s explana i 
of the standard error of measurement in the next chapter, where u 


more crucial to the discussion ) 

We must distinguish between the reliability with which an exami 
marks an examination, and the reliability of the examination itsel 
(See Chapter VII ) f 

The average “examiner reliability” or “reliability of marking 
this history examinauon was 83 In comparison with this, the “reliability 
of marking’ of an objective examination is always 1 00 For the inability 
of two examiners to agree on the marks they assign is not the only source 
of unreliability in an examination There are also factors related to the 
examination questions themselves Thus the reliability of the actual 
marks assigned students is due to a combination of these factors We 
have called this reliability (which is comparable to the usual use of the 
term “reliability” m objective examinations), “marks reliability” (See 
Chapter VII for fuller discussion ) 

The marks reliability of this examination is 71 The technically 
trained reader will recognize that this is quite inadequate for the purposes 
for which examinations are given, i e to distinguish reliability between 
individuals 

This is underscored by the standard error of measurement, which 
is 3 43 This means that 5 out of every 100 candidates will be awarded 
marks which are either about 7 marks above or 7 marks below what they 
would have received on a perfectly reliable examination 

Unfortunately, all of these reliability figures are misleading, 
because they are based on the assumption that all examiners are using 
the same “scale” in marking their examinations— in the layman’s terms. 



HiniLictirs of misery marking ten jy 

that they art alj 4 marking to the same standard But they arc not For 
e-tamp e, suppose a psjchological test has a co (Ttetent or reliability of 
83 and a standard deviation of 5 6 (the average of Examiners 15 and 6) 
What would «e conclude ? In these circumstances the standard error of 
the difference — which tells us the difference between two sets of marks— 
« 3 26 marks out of 50 Thus the average difference between two exami 
ners should be 2 1 marks and the maximum about 8 marks out of 50 
Now let us look at an actual pair of examiners whose examiner 
reliability is 83 


Cand date Marks by Exam ner D (Terence 
15 6 


A 7 22 

B 5 15 

C 8 27 

D 20 34 

r II 29 

r 6 26 

G 2 16 

It 7 17 

1 JO 22 

J 5 13 

Average difference 


16 

10 

19 
14 
18 

20 
14 
10 
12 

8 


14 1 


Notice tint Examiner 15 passed only one candidate while Examiner 6 
passed seven The average difference between the two is 14 marks out 
of SO (not 2 I) and the largest difference is 20 (not 8) This is the penalty 
for not using modern methods of scaling 

It must be remembered also that these figures (ic the examiner 
reliability of 83 and marks reliability of 71) were obtained for a set 
of answer books m which all candidates had answered the same five 
questions Had the usual situation prevailed — different candidates 
answering different combinations of five out of nine questions— the 
correlations between examiners would have been lower The highest 
marks reliability that wc found for history where a choice of questions 
was allowed was 43 as against 71 when all candidates had answered the 
same five questions (Sec Figure 4 5) 

8 Hi// a help to ha\e several examiners mark each answer book and 

a\ era%e the results ? 

Multiple marking of answer books is the only way to reduce the 
unreliability of marking to the minimum And this also Increases total 
reliability, as examiner unreliability is one of the factors in the tota 
situation 
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To study what would happen, we assigned examiners “at random” to 
candidates For example for Candidate A we pulled out the mark-sheets 
of 10 of the 90 examiners We wrote down, in the order in which the 
marks sheets had been drawn, the marks assigned by each to candidate A 
Then we put these back, and again drew another 10 examiners for candi- 
date B Thus we obtained a senes of ten marks for each candidate 

Then we averaged the first two marks, then the first three marks, 
then the first four marks etc for each candidate We compared these 
averages with the ‘correct mark based on the average of all 
90 examiners 

How many examiners do we need 7 This depends partly on how 
accurate we want to be Suppose we say that the “error of marking 
should not be more than 2 marks (out of 50), for at least two thirds of 
the candidates Our results indicate that each answer book would have 
to be marked by fi\e different examiners to reach even this modest 
standard of accuracy 
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Highlights of 

“Four Thousand Re-examined ” 


F OUR THOUSAND Class X final Examination answer books nvuc 
marked a second time 

One thousand of these came from each of the following subjects 
history, Hindi biology mathematics 

Half of the answer books were marked a second tunc by the same 
examiner — the other half of them were marked the second time by a diff 
crent examiner 


In each subject all examiners had worked for the same Board and 
in fact for the same Deputy Head Examiner 

The instructions issued to the examiners were the same ones which 
had been issued for the regular examination Jt was not felt necessary 
to send any model answer hooks as all these examiners had already 
marked the same examination several months earlier 


1 Differences behtcen examiners 

The differences between the two markings of each of 4 000 answer 
books ranged from 0 to 20 marks out of 50 Surprisingly *t was a 
mathematics (geometry paper) examiner who differed from his own first 
marking by 20 points Examiners re-examining their own answer books 
did only slightly better than examiners re examining some other examiner s 
answer books The average difference between pairs of examiners of the 
same answer books were as follows 

History 3 0 out of 50 marks 
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Hind! 2 8 out of 50 marks 

Biology 2 8 out of 50 marks 

Mathematics 2 6 out of 50 marks 

It is interesting to note that mathematics, an “exact” subject, shows 
almost as much average difference between examiners as biology, which 
depends more on judgement As will be pointed out later, this is due to 
difference^ in the scales used for these two subjects 

These average differences may look small But one does not expect 
to be able to climb Mount Everest just because the axerage height of the 
Himalayas may be only 10,000 feet In other words, each of these averages 
represents a wide range of differences between examiners (See next 
section) And the ‘ difference” for any particular candidate can be dis- 
astrously large 

This is shown by the fact that examiners frequently fail to award 
even the same class or division to a candidate in both markings Some- 
times a candidate who was awarded a Second Class by one examiner 
was failed by the other — sometimes a Third Divisioner was raised to a 


First (See Table 4 3) Overall, 


History 

40 1%1 

Hindi 

32 0% I 

Biology 

38 5% 

Mathematics 

22 9%J 


were awarded different Classes 
or Divisions when they were 
marked a second time 


2 How large a mistake is likely to be made for any single candidate? 

Table 4 1 shows how much examiners differ from each other, when 
marking the same candidate It is a frequency tabulation of the number 
of differences of each size appearing when (a) the same examiner re- 
examined his own answer books, and (b) the answer books were re-exa- 
mined by a second examiner In each case, the frequency distribution is 
based on ten pairs of 50 answer books each 

For example, look at the history columns When pairs of examiners 
(the D column means “different’ ) marked the same answer books, in 
one case they differed by 16 marks out of 50 In two cases they differed 
by 12 marks They agreed completely (0 marks difference) in only 46 
cases out of 500 


, ^ ow 1°°^ the mathematics (geometry) column Discouraging, 
isn t it. to discover that even in mathemat.es an examiner can disagree 
with htmselfby as many as 20 marks out of 50, t e 40% of marks' Granted 
that thts is only one case out of 500 But the fact that this is not just a 
single random, or freak error is shown by the fact that there are several 

?w" r °n^o r y l1 ’ and 16 This means that if 

there ate .0.000 candidates, we can expect that as many as 75 to 100 of 
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them can be victims of errors close to 40% of marks 

Table 4 1 speaks for itself, and the experienced teacher (or examiner) 
will be able to extract a great deal of interesting information from a 
detailed studv of it And Table 3 1, with many more examiners involved, 
showed that even larger errors (up to 27 marks out of 50) are occa- 
sionally possible 

Fortunately, the very extreme variations are not very frequent The 
average differences (as pointed out earlier) are not too large Still, “Many 
persons are drowned in rivers whose average depth is only two feet ” 
It seems to us that we are rightly much more concerned with the extreme 
cases than with the averages If e\ en one per cent of higher secondary 
school examinees can ha\e their entire future ruined by these large random 
errors, can ne justify this by saying that the “ average ” error is not more 
than 5 or 6 marks* Such large random errors just do not occur when 
modern, scientifically based examination methods are used 


3 Can ne trust “ Pass Percentage '? 


A great deal of weight is placed on the * pass percentage” Teachers 
arc condemned if the pass percentages in their classes fall, praised if they 
rise Institutions are compared by their pass percentages, and nothing 
else This is, to say the least, incredibly naive There should be at least 
some attempt to take account of the quality of the students admitted, 
before deciding that the pass percentage tells you how good a teacher 
or institution is It is well known, in fact, that some institutions raise 
their pass percentages just by refusing to admit any student who has 
any possibility at all of failing 

Aside from this, is the “pass percentage” actually a reliable figure’ 
It would be reliable only ,f it could be shown that it .5 a constant standard 
1 e that examiners agree, with themselves and with others, on the pass 
percentage v 


m the last cha P ter ‘hat ninety experienced history examiners 
rt! n PaS -i pcr ^ entagc for thc same ten answer books But 
s,ud *- "' h ' re examiners are paired with each other, 
P /' V ,' 0US rcc0rds ' > And whM about subjects other 
pres, 011 s suidy^ 1 " ** r “ S °"' hcre ’ to d '-°“"‘ < b ' -suits of the 

,n the 2 C " ar Tab,c 4 2 ‘hows ‘he difference 

matlin?™ Hindi TT b0t>k5 passcd - ■» ‘he first and second 

Cn u tlrr 1 ' ’a hCr ' " as °"' difference of 5S% between 
™ D had 5Ct ° nd 1,mc by lhc «™r < s > examiner 

Hm ^e seconiT me b ” 1!? ,bc «* <™e he marked 

C °" d " rat bc raark ' d *h« -me batch of 50 answer books, 
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he passed only 8% of them This difference (66%-8%=s 58%) is the 
difference tabulated in Table 4 2 

The highest difference for Hindi when the second marking was by 
a different examiner (D) was 54% In fact, in 20 batches of 50 history 
answer books each, only two batches had the same pass percentage the 
first and second times they were marked 

The situation was sightly better in history than in Hindi, and const 
derably better in biology But even in mathematics, one examiner differed 
as much as 12% with himself, in pass percentage on two markings of the 


TABLE 4.2 

Frequency Distributions of Differences in ‘ Pass Percentage ' between 
First and Second Examiners of Froirrv Sen or Fifty Answer Books each 
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Mrac answer books And only four out of 20 pairs, earn in geometry, 
hao the same pass percentage the first and second times. 

In the “Total" column »r can see that only 8 out of 80 pairs had 
the same pass percentage the first and second time. 

The last column accumulates the totals. Thus we can see that 
or the 80 — i e. one-quarter-differed in pass percentage by 16% or more. 
Nearly hairof them differed by 8% or more. Of course, » ith large groups, 
the differences tend somewhat to “ascrage out". Still, it is evident that 
rot much reliance can be placed on pass percentage. 


4. Can ue trust Class or Division ? 

Errors in exact marks would not be so serious if we could at least trust 
the broad Classes or Divisions If a “First Divisioner** is really a First 
Divisioner, then there may be some predictability in the academic world. 

Several studies done in India have correlated subject marks at various 
successive stages of education, longitudinally/ In general, the correla- 
sions have been low-, indicating only a minimum of stability over time. 
The present study approaches the problem from the angle of having the 
same answer books marked twice. 

In the previous chapter we saw that when a large number of exami- 
ners mark the same answer book, there is a wide spread of Divisions. 
In some ways it is perhaps more realistic to consider only a pair of exami- 
ners, since double-marking actually could be carried out routinely. 

Suppose a candidate is awarded a particular Division. What Division 
is he likely to be awarded if the same, or a different, examiner marks his 
answer book a second time ? 

Table 4.3 answers this question. For this table, we have pooled the 
results of both the 500 answer books re-examined by the original examiner, 
and the 500 re-examined b> a different examiner. (These will be reported 
separately later.) Pooled psychological data, particularly correlational 
data, are frequently suspect In this case, however, the pooling is clearly 
justified. Examiner A’s Third Divisioners may be awarded a Second Class 
by Examiner B, while Examiner C’s First may be awarded a Third ,by 
Examiner 1$ Nevertheless, it is quite possible that by chance Examiner 
A s 'Candidates might have been re-examined by Examiner D, and Exa- 
miner B s by Examiner C. Thus we are justified in placing them together 
in the same table. 

^ Table 4 3 ts eas) to read. Look, for example, at the row marked 
_ ,n _the histoty_chart In the first column, headed “Number”, we find 
that 8 students were awarded a First Class or First Division on the first 
marking. Out of these, only 1 was awarded a First m the second marking, 
5 were awarded Seconds, and 2 Thirds. Or we can read the same 
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chart downward Here it is the first row (m column “I") that tells us that 
15 candidates were awarded “J by this examiner Of these only 1 was 
given the same marl, by the other examiner, 10 were awarded a Second, 
and 4 a Third (We have not labelled the two sides “first ’ and “second ’ 
examiners, as they could easily have been interchanged ) 

To make the charts simpler to read, boxes have been put around the 
squares which represent agreement between the two examiners At the 
bottom you will also find the percentage of the total group who were 
awarded the same Division or Class by both examiners This percentage 
ranges from 59 9% for history to 77 1% for geometry So even m mathe- 
matics, nearly one student out of every four would be awarded a different 
Class or Division if he were examined a second time 

A critic who saw these charts suggested that we got only two-thirds 
agreement on Class or Division because it is easy for candidates on the 
borderline to go up and down one point A grance at the table of differ 
enccs (Table 4 1) will show that it was not just changes of one point that 
produced these results Furthermore, anyone familiar with objective 
test reliability scatter charts, will know that agreements above 90% can 
easily be achieved with reliable examinations 

Table 4 3 shows us that, taking all thousand candidates, even in geo- 
metry an average of one quarter of the candidates would NOT be awarded 
the same Division b) a second examiner For other subjects the average 
number changed is more than one-third 

Taking all four papers (4,000 candidates in all), Table 4 3 shows that, 
on the average, candidates are likely to receive the same Division from 
two examiners only two-thirds of the time One and one third per cent 
of them (13 out of every 1000) would have two marks two Divisions apart, 
if their answer books were examined a second time Note that even in 
mathematics, there were eleven candidates who were shifted from a First 
to a Third or a Third to a First, a Second to a Fail or a Fail to a Second 
Fortunately, none of the candidates was shifted three divisions Thus we 
may state that if a candidate is awarded a First by one examiner, we 
can at least be sure he is not likely to be Failed by another examiner 
But a Second may be a potential Failure— and vice versa— in at least one 
candidate out of every one hundred m „ 

Table 4 3 shows that even the meaning of “passing" and railing 
is uncertain 

62% of the candidates were passed in both markings 
22% of the candidates were failed m both markings 
16% of the candidates were passed once and failed once (or vice 


versa) in the two markings 

Twenty-seven percent of nil those fasted m ettber marking were 
ra the other marking (This is based on the total failed, and not just 
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those who crossed the borderline ) 

Students of educational statistics will be interested in this The 
over all coefficient of correlation* for the Mathematics scatter chart 
shown here is + 93 Thus we have a high examiner reliability — utterly 
dissipated by the use of a scale with a standard deviation twice that of 
any other examination The fault of course, is not in the use of the entire 

TABLE 4 3 

Scatter Charts of the Classes or Divisions Awarded by Pairs of Examiners 
(both Same and Different) to One Thousand Answer Books in Each of 
Four Subjects 


History 



No 

Dist 

I 

II 

III 

Fail 

Number 

. 1000 

15 

178 

426 

381 

Distinction 


1 | 





I 
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'1 
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2 


II 

120 

io ! 

62 1 

40 

8 

III 

618 

4 

109 | 

DE] 

171 

Fat 

254 


2 

50 

202 | 


Same Division = 59 9 y 9 


Hindi 
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Biology 
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pass mark in most subjects 70% or 75% is required for a pass in ma- 
thematics) (2) Alternatively, statistical methods can be used to equate 
or calibrate mathematics marks to the more usual scale (See chapter 
on Scaling) 

The remaining charts may be examined by the interested reader 
On the a\erage one-third of the candidates w ere placed m different Classes 
or Dmsions in the two markings It is obvious that we cannot place a 
■very high faith even in Class or Diusion 

J What are the chances of a change in Division if an answer book is 
marked again 7 

There is another way of looking at the data in Table 4 3 What the 
student (and his teacher, who is not his examiner) wants to know, of 
course, is this “Suppose my answer book were marked again what 
are my chances of ha\mg my class or division raised or lowered 7 ” 
Figure 4 1 answers this question 

To get a reasonably stable answer, we used the pooled data for 1000 
answer books in each subject from Table 4 3 But since there is really no 
logical reason why one examiner should be considered the “first * marker 
and another the ‘ second ’ — i e for, logically, they might as easily be 
reversed— we averaged the results of the two * 

Figure 4 1 gives the results of this study For example, look at 
History 

The first graph shows that there were 12 First Divisioners in history 
When these were marked again only 9% of them remained Firsts In 
other words, only 9% of these First Divisioners were awarded a First 
by both examiners Sixty five per cent were lowered to a Second Division 
by one of their two examiners, and 26% were awarded a Third 

It is quite apparent that a “First” in History usually means only 
that one particular examiner is pleased— wot that there \s any inherent 
merit in the answer book that would be recognized by others also 

The second graph shows that out of 149 Second Divisioners marked 
again, 5% went up to a First and 3% were Failed And even of the 

* Take for example the graph for geometry First Divisioners From Table 4.3 
we find that one examiner awarded 143 Firsts while the other awarded 139 Firsts 
They are averaged as follows 

Number DtsL I II III 

143 15 94 30 4 

139 29 W 15 1 

282 44 188 45 T 

100% 16% 66% 16% 2% 

The average Dumber of First Divisioners is 282/2=141 
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FIG 


4 1 — (contd ) 
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"X J* ""** rcl,a . bIc tha " the others They are not There is nearly tw.ee as 
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wkjc a range in the Fad category as in Tirst Second, or Third Thus a Third 


Dmiim" « vIS** ’rrtji? ** c ? nErd hy onl Y 7 (out of 50) to put him m another 
ra lures mark can be changed as much as 16 points and still 
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History 

Hindi 

Biology 

Geometry 


9 *. 
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II 

42 % 
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III 

64 % 
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64 % 
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Fail 

63 % 

67 % 

63 % 
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stand — a candidates “Dhision" is as much a matter of the chance of who 
marks the answer book, as it is a matter of the inherent ability of the 
candidate. 

Let us suggest, however, that students look rather carefully at Figure 
4.1 before they demand that their answer books be re-examined. Except 
for Failed students, in esery other category (with the single exception of 
biology Third Divisioners), there is more chance that the mark will be 
lowered than that it will be raised, on re-examination. (Statisticians will 
recognize this as a good example of the familiar principle of “regression 
towards the mean”.) Thus re-marking — if it is ever permitted — will help 
the low-mark students only, and the student who hopes to “improve 
his division” is not likely to benefit. 

A closer look at the Failures is in order. Notice that 37% of the 
Failures in history and biology, and 33% of the Failures in Hindi were 
passed by a second examiner. (Only 10% of the geometry Fails were 
changed, but as we have seen this is partly an artifact of the extremely 
flat marking curve.) Educators may well ponder the value of an 
examination system where one-third of all “ Failures ” in a subject would 
be passed on re-examination. 


6. Six examiners compared on equhalent answer books 

There may be a suspicion that some or the results reported arc due 
to the fact that different examiners received different batches of answ'er 
books. Perhaps the batch influences the examiner as much as the exa- 
miner influences the batch. Suppose, as in the ninety examiner project, 
all examiners had marked the same batch of answer books? 


TABLE 4 4 

Largest Difierence between Marks of One Highly Experienced Examiner (x) 
and Six Other Experienced Examiners 


History 
(50 marks) 

Hindi 
(34 marks) 

Biology 
(50 marks) 

Geometry 
(50 marks) 


E 10 

F 10 

G 7 

H 16 

I 8 

J 11 
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9 

8 
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9 
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8 

E 

F 

G 
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14 
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14 
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particular study. However, m one part of this study we drew (in 
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For each of the subjects u e took 300 answer bools which had been marled by 
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each subject) 300 answer books from one espec, ally expenenrxd examine' 
(called Exammer X) These answer books were divided into six cq 
batches The average marks'in each batch were exactly the same Each 
batch of 50 answer books was sent to a different examiner ( Exa ®' n “ s . “ 
to J, but a different six for each subject, of course) along with 50 ol ms 


own, making a total of 100 \ 

Figure 4 2 shows the results Start with the results of the seven 
history Examiners The top graph (shaded) shows the marks of the 
original specially experienced examiner, Examiner X, for all 300 answe 
books The line in the middle shows his mean or average marks The 
length of the graph line gives an indication of his distribution of mar s 
(This is one standard deviation on each side of the mean ) 



FIG 4 3 

r ° ur th ° u »™l Higher Secondary 
BoardExaminatiomofl 964 Hf h H Paper ^ b ^ efe ' en examiners, during the 

and Iherreirta from 7 to 34 b “" nmlliplied by 50/34 
mil come Ihe'^Vlie^^h^™^™ 1 '- » <•» 
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(first) marking, for the four subjects And for those not used to interpre- 
ting cumulative frequency curves, Figure 4 4 graphs the four distributions 
of marks In both figures, the lowest mark of each Class or Division is 
shown (17, 23, 30, 38) 

It is interesting to sec how very different the marking system seems 
to be, for the various subjects How else can we explain the very diffe- 
rent distributions of marks? In this study, the lack of a completely random 
sampling may have contributed to this But other studies have shown 
the same differences Some may say, of course, that the different distri- 
butions represent differences in the abilities of the students, not different 
marking systems Although it is probably impossible to prove definitely 
that this is not so, it is certainly highly unlikely Harper (1963b), for cram 
pie, showed that students with the same ability received different ranges 
of marks in biology and mathematics Mahalanobis (1934), and others, 
have shown widely different distributions in different subjects for the 
same sample of students The Education Commission (1966, page T) i) 
noted very pointedly that ‘‘an 80 per cent mark. say. m mathernatics 

docs not convey the same meaning as, say, 80 per cent mark in i ry 

E " 8i This would not matter if we treated each t#a'i i mark, |cparate'y. 
and d,d not add them However, when marl, are added, the d «nb» ton 

,963, also emphasized this point | 

marks seems to be that each subjeet come equal «e,g ' 
that the weight of the subject m the . aggregate total del I rm. ! 

number of marks allotted to ,« etample mathr- 

marks are added The result is tha , n fluentiat in determining 

mattes — may actually be twice or t ree i flrc supposed to be 

the candidate's final rank as are other subjects which PP« 

equally important obviously not •'normal'' 

It is also noteworthy that Ihrs , hat abilities are normallj 

Both theoretical and factual evidence n nc s[lK | l0 n of students 

distributed in any average group » ou , before reaching 

that has taken place (the weaker , KSC in lower scores (negstive 
high school exams) can captain a '£ (hc burning (platykurlic 

skewness, But there is no such way < ,r ,ou aieragc out the 

nature) of the curves (Yes, tb* « of course «- 
artificial abnormalities around pss ^ ^ , h . mthads 

penally sinking for geometry ihal any atatistical scaW 

they use to determine marks are jca je on th« ”* ura A 

procedure would be imposing ac( |y the opposite The me 

(ton The above facts, however, prove exact y 
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now being used to determine marks (particularly in mathematics) arc 
actually quite artificial and arbitrary Furthermore, they take no account 
of all we now know about the distribution of human abilities and 
achievements This situation can only be corrected by the application of 
scientific scaling techniques to examination marks 

Note what Dr Taylor calls “the J-cfTcct”. This is the piling up of 
“just barely passing” marks You can see in the graphs what the late 
Dr Habibul Rahman, in his usual delightful way, referred to as “the 
Qutub Minar’ efTcct In htstory, for example, 237 candidates are 
* awarded * (and this is the only appropriate word, since this obviously 
is in no rational sense a measurement) a mark of 17 (passing), none are 
awarded 16 (which would be just barely failing) This is a sort of 
educational ‘ brinkmanship” for which there is \ery little scientific or 
rational justification At best, it is an obvious confession on the part of 
the examiner that he k nows his results arc fallible — and he does not 


want anyone to fail 4 by chance” Yet, if this is the motive, (and it is 
certainly a laudable motue) there are far better ways to ensure this result 
(See, for example, Taylor’s 1964 discussio n of “Grace Marks”, which is 
summarized in Appendix B of the present volume) Actually, by trying 
to adjust just the borderline marks, the examiner takes care of only a 
small fraction of the total range of “failure by chance” marks, anyway 
Furthermore, a simple knowledge of correlation between examinations 
(generally low) and what that implies would indicate that these tactics 
arc not likely to be e ffectivc in many cases, anyway How much better 
to be honest, and eliminate this “J-effect’ in marking 

The fact that examiners would probably like to be more honest 
is indicated in Table 4 5 by the second column, R, of each pair Notice 
how greatly reduced the J effect is, when the examiner marks the exam 
books the second time Knowing that the marks “will not really count", 
he apparently feels freer to be completely honest in marking The number 
for whom he feels justified m adding an extra mark “to keep them from 
i mg is much smaller under this circumstance Yet even in these 
;“ nCCS ’ ** unadmitted assumption that 17 is “pass” remains 

think in t un „ iased m arking, apparently, one must cease to 
think m terms of “pass” or “fail ’ 

of e wrninaUon r ma rV? ^ pr ,° b [f m J s s,m P le 0) Introduce the scaling* 
marw fn? orm ^ “ d ^ (2) com ^\y outlaw the J-effect in 
exan ™” that .he examiner s “17” » no 

ere p led un af the h T , chan S= " » an examiner’s marks 

returned Z r m rW ,""' W books sh °“ ld bC 

2fli ^_niark,ns 11 w °“ 'd be s.mple to send only these, asking 

* See the chapter on Scaling m this book. 
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that human examiners must be following a different s> stem. The larger 
group of 194 must be passing, not failing. This is why the lines across, 
between .Divisions, are one mark lower in the Hindi columns than in the 
others. However, in order to make the graphs (Figures 4.3 and 4.4) 
visually comparable for all subjects, wc raised the Hindi marks between 
7 and 34 (inclusive) by one point each. 


8. Reliability 

What v.e have discussed so far is just the difference between two 
examiners marking the same essay-type examination. But suppose the 
candidate also takes a different examination, even one which is supposed 
to measure exactly the same subject-matter as the first examination? 
In other words, suppose he takes two alternative examinations, each 
marked by a different examiner? How different will his marks be? 

The over-all summary of this study is contained in Figures 4 5 
and 4 6. Figure 4.5, the coefficient of reliability, will be readily understood 
by the technically-trained. But Figure 4 6, the Standard Error of Measure- 
ment, is in many ways more meaningful, and is certainly easier to explain 
to the layman. 


There are errors in all measurements. (Try asking ten men to measure 
the length of the same table. You will get ten different answers.) By 
certain mathematical formulas we can describe this “margin of error” 
for an examination This figure is known as the standard error of mea- 
surement (SEM). 


Suppose that a student was examined 100 times, on a hundred 
different but equivalent papers and marked by a different examiner 
each time. (We can’t do this actually, of course-he would go on strike!) 
We then might think of the a\erage of these 100 different marks as the 
true mark of the student. But of course, some of his marks would be 
above this true mark, and some below it. Two standard errors of measure- 
,f“ Ve and on = ‘ K:low the mark) gives the range h , thin 
f t , ° f '■ ^ * ,D Four ^"dard errors of measure- 

s' ° r T? tW0) 81VC th= Ian S e ofaboul 95 % of 'he marks, 
the ca^dAA™ 8 ' Standard errors of measurement, virtually all of 
the cand.date s possible marks will he w.thin this range 

a ranee TAuC * ‘' lU= d,ffere " Uy - Ir 9 5% of the marks he withm 
alsTsav "merl T US u S ' andard OTOrs of measurement, we can 
more than Uo sta d d ' ? ,0 ° ,hat the studEnt ' s mark w '" * 

Ta7k ” From I /"T ? f f “surement above or below his true 
ment of eeom^f i ™ flnd ,he avcra S= ^"dard error of measure- 

sTi 5 - 4 ,s r- 

,s a 5 /o chance for any candidate that 
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hts mark on a —<■« — ^ “SE 

«t« ««« *»«'*?“ to„he".tual,o» for history, Hind., and 
From F igure 46 »e see ,L rares range from 3 2 to 3 4 mark! 
biology is not quite so bad Here b^ Jj“ td pt 27 marks out of 50, 

srxss= ; -”“ 

Division , data arc presented in a graphic fashion 

To make this clear, tin l data n P lus and minus one standard 
in Figure 4 7 Here two ranges are plott P^ ^ marks) and 

erfor ° r nbed Au'heWtom 

The a a vernEe n sEM of each ££S$.« WP' 

of the graph is the range of a ra “ tr p uon the average of histont 
essaj-type examination (ht 50 mark5 used , „„ r . has already 

other examinations /• ^ subjects? different scale 

««* **•«. tha " 1963 b 
The reason is because exa mwers ‘ , m - r ks is l‘k e 

from the scale used by mu , 'thematic* and msthe 

Mahalanobis, 1934 ) Co i P The marking &dl markm 

comparing centimetre < tban ol ber exam 1 ^ any other 

matics has a much sma r tha n a mark ” djd describe 

mathematics is appao tnt y number of them i “ l0 ches «* 

subject Therefore, a much larger Mtun^*®^ marks 

a student of high % when « then m awardmg 

-awarded” to a long F ,g„re 4 4 or Table 4h rf Dlst , D ct,oos 


-awarded” to a long S s( ® m Figure 4 4^' «' of Dis „„et,oos 
as being equal to other r u „ a mu ch great 
Classes or Divisions) the 
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Key 


I 68 A e< cand'dates ~1 

n nety nine per cent of candidates Included In th 5 range 



HG 4 7 

Relationship of error in examinations to the total rair of marks out of 50, ba«J 
on average Standard Error of Measurement of each examination 

and of Failures than is really justified And the result, also, is that the 
SEM is unnecessarily enlarged and the reliability (in the layman’s sense) 
of marks is lowered 

If as has been widely advocated, but seldom done in India — * 
a 1 marks are “scaled before they are published, then the advantages of 
better marking m mathematics will show up in a smaller standard error 
or measurement As a matter of fact, if this were done the range of error 
~f UrC 4 7 " 0Uld ** 0Qly s!, S ht, y larger for mathematics than for the 
exam, nation By “scaling ’ we mean that all marks are changed, 
statistically into the some sonic, rather than there being different 
I960) ' dlfr ' renl lyp " of sub J Kts (Mahalanobis, 1934, Gullikscn 

abilit^ "° rd . ab ° ut the r ' ll3b,llt ’« m Figure 4 5 The rch 

ter Vm E „ ^ d h ' re ^ mark5 ’ r ' 1,abl,lt ‘“ You will find later (Chap 
thre- tvoe^nf 6 ctc dlscu ssiou of this concept Bnelly, we have 

Iv-o' txMninm y , r “ dcr rcllab ‘l*‘y (the correlation between 

nets marking the same answer book), content reliability 
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(analogous, m an essay test, to “rehab, My" ,n an objective test), 
and marks reliability If a student takes Form A which is marked by 
Examiner X, and then stnctly parallel Form B which is “' le J ’’J E * ' 
miner Y, the correlation between the two sets of marks will be the marks 
« "ability" of the test As can casdy be seen, this is the only *** 

reaching this standard reliability of mittuml 

The conclusion IS painfully obvious The relmoiiy <# 

examinations — eien in mathematics is n ^ )S not a new dis 

men I of differences In the abilities of m m tlIMi 0I1C wonders 

covcry Since it has been know , (h basls 0 f remits 

why we continue to bless or curse students for life on 
Minch are often only a little _ 4 5 a „d 4 6 In each 

However, we are not yet finished wi g marking ten" 

of these, a circled “90" of marks re 

experiment. It is particularly Cipher than any other history 

liability for that study is so very m undoubtedly 

reliability figure (71, wh e the nos. “ ^ »nswer “ 

due to the fact that in that study. answe red In the presen t 

which th e same five questionM mdJeenj; 

• “Making the assumption tot to a « J^on d a sn.de grovpmlh 

'ZTeXZ - 

(b) To evaluate differences in level 0 g g4 

(c) ToTvataKtoel of mdividwl ^^“"Umphshment m wo „ 

« T »e;r, £s,‘ “ ihcse 

1. must be recognized, „ ou ,d be reasonable to eW* J, „tab.Uty 

the above assumptions as t .ndividuals and K roup *. _ racUc al value* *h ,ch 


humour in this slip or tne chancc to flower by U» Jvcl for- 

been “exterminated ’ before they ^ a «*£ Wc can on! r cow* ^ eW 1y 
or traditional cassoy-iype t®” 1 fcad the ■** " „|t by re»>i”« 

innate iha. Baste, n K «"? 

“failure" in mathematics— ana e «mmation. m 

h,s lowest marks in the Pobhe Service 
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study, however, different candidates were — as usual — answering 
different questions All experts contend that examinees should ne 
allowed no choice of questions (The only exception is when questions 
arc set in different alternative textbooks, in which case the choice should 
be "either-or” ) The comparison of the results of our two studies drama- 
tically confirms the large loss of reliability when students are allowed to 
select any five out of ten questions 

The SEM is, of course, related to the marking scale used The 
marking scale used by the 90 examiners seems to be somewhat wider than 
that or most of the 10 examiners (Possibly this is related to differences 
inmarkingtenvs 100 answer books ) Thus the SEM for the “90 marking 
10 study is slightly larger than in the present study. Also note one 
\ery small SEM, which was due to one examiner awarding only an ab- 
surdl> narrow range of marks 

Finally, remember what was pointed out earlier (Chapter 1II» Sec- 
tion 7) about the standard error of measurement What we have graphed 
is the SEM for marks on each examiner’s own particular scale These 
would hold for another examiner using the same scale But examiners 
use different scales, with different means and different standard deviations 
Thus the actual errors — as we have seen from Table 4 1 — are much 
larger than these SEM’s would lead us to believe 


9 Is (he situation hopeless 7 

From the above it is obvious that the reliability of our traditional 
essay -type examinations is completely inadequate for the purpose for 
which we use the marks, i e to decide indnidual merit Does this mean 
that all examination results are totally unreliable 7 

Not quite Fortunately, as we add together the marks of correlated 
papers, the reliability of the aggregate marks tends to build up This 
is because some of the errors are averaged out The candidate who happens 

to have had a very stiff examiner on one paper may, by chance, have a 
lenient examiner on another paper Some, of course, will have a stiff 
examiner — or a lenient examiner — on both papers So the errors 
do not completely average out, by any means — and for some candidates, 
the errors may not average out at all However, at least the total number 
or candidates who are seriously mis-marked is reduced, when we take the 
total aggregate marks ignoring marks on individual papers 

Unfortunately — as we have mentioned before— there is no “avera- 
ging out in the case of a failure In this case, the chain is only as strong 
r” r CS ! mk At the cructal le ' el — the decision as to whether a 
Tot */ 1 I™ P ™ eS ~ ,he a 2 g re gate results are just as unreliable as 
f a single subject As Taylor (1964) has shown, our present un- 
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scientific system of "grace marks" does no, adequately correct for these 

errors (See Appendix B ) 


10 Are exanmers influenced by, he number of pages annen’ 

Many students are 

or pages and award marks accord ngy ^ ^ knom , grea , dea | 
this is no, true It is obvious h ^ kMm aImost nothing 

will wnte more pages than th ks a „ d number of pases 

so some relationship between examin 1 , ery minimum 

written, s to be expected And this is all we found^ ju ^ ^ 

relationship — in all but two o o mtjns that the influence of 

the maximum correlation was 4501 'or the maximum possible 

number of pages on marks was only * ' ovctlap , he variances ) 

mlluence (In technical tertns,tllis wa 0 „ ly | 4 % of the maximum 

The average correlation was 3 , " study were at all repre- 

possible influence So , f the -«» “ ) M „, examiner, » f* 
tentative, we can say that less <'«■» « ““ * Sludenl, shou d 

enced sigmflcanlly by ,7^1 m examiners if student would 

be informed of, h,s » would -padding" and trash 

stop filling their answer books with so much P 


11 The need for scaling this book is 

The reader has already found tha, a recutrent them 

“ ,he n The r "eed'for scaling examination marks h» bee" po, ^ „ 

studies', also,' repeatedly and 

,s 

or temperature Supp other ieads 4 o deE tKS 1 . (hc average 

degrees (Fahrenheit) average them a ^ tha t 72 

they both read m "deg e« y»» pitied of 

temperature ,s V. ***?£ £ m perat«re. so all you 

SSK-gssrS." 
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“degrees” is first to convert one scale to the other Here is the formula 
£ C + 32 = F 

Thus, in the problem above, (9/5) (40) 4- 32 = 104 — so the two thermo- 
meters are actually reading the same temperature Similar formulas in 
the form ax + b can be set up to translate the scale of any examination 
(or any examiner) to the scale of another examination (or examiner) 
More commonly, we translate the scales of all examinations and 
examiners to some arbitrary scale which we have accepted as the 
standard This scale may be based on the average of the scales of all 
examiners and subjects, or it may be based on some other type 
of arbitrary decision 

Any student of educational measurement knows one or two scaling 
methods Others have already been published in India (Mahalanobis, 
1934, Harper, 1963 a) A fuller discussion of scaling, its rationale, and 
various methods especially suited to public examinations are presented in 
a later chapter entitled “Scaling”. It is hoped that, even though it contains 
some technical details, every reader concerned with the final results 
of examinations will read it (It is not necessary to read the technical 
chapters in between to understand the chapter on Scaling ) 



PART TWO 


Detailed Reports 
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composite whole 


0 f thm« item 
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Review of the Literature * 


T HE PROCESS Of education includes three major divisions 
formulations of objectives designing education for the achieve- 
ment of the objectives and assessing the outcomes of education 
The function of examinations is to provide such information as is needed 
for a precise assessment of the outcomes of education As such examt 
nations are as old as or rather older than any sj stem of education 
written Examinations are of many forms They may be performance 
oral or Written examinations may be classified as objective-type, very 
shortansvver type short answer type and traditional essay-type 
The genesis of traditional essay type tests may be traced to the 
Chinese system of administration in vogue around 1 1 15 B C The Chinese 
used the essay type tests for the selection of candidates for Government 
offices Each candidate was shut up in his narrow cell for da) sand nights 
and was required to produce compositions in prose and verse on the theme* 
assigned to him (A detailed description of the Chinese system of testing 
is given by DuBois 1964 ) It was probably because of lengthy testing tune 
and -vast coverage of test material that as Ebel et al (1958 p 1502) 
put it * the competitive examination system of ancient China helped 
itscrve its primary purpose of providing men of ability for the service 
Of the state ’ The Chinese system of lest mg travelled though not in 
that strenuous form from China to the West By the 19th century, the 
. ( chapier is bv the junior author and is adapted from hit unpubMsfd matw 

jcrtpt A Cr» »/ ShJf of focxvTyt* Examinations a ibev t prevented for the d me 
of D~!W or G»o!»» Itown V I™ 
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essay-type tests in the traditional form were well-established in western 
countries for the award of degrees and diplomas. (A concise narration 
of early history of examinations is given by Ebel et al. (1958.) 

The use of the essay-type tests in India came as an inevitable con- 
sequence of the introduction of the western system of education. Indian 
universities were modelled on the lines of the British universities; 
therefore, they also adopted the examination system in practice in the 
U.K. In India, the essay-type tests were used on a large scale for the 
first time in 1857 by the universities at Calcutta, Madras and Bombay 
for the selection of students, and later for the awarding of university 
degree. The wide use of the essay-type tests at the university stage 
gradually led to the adoption of such tests at the lower stages of 
education also. 


Over the years examinations have attained too great an importance 
in the West as well as in India. People have been branded for life on 
the basis of their examination results. Not only a man’s academic 
but also his social status has often been judged from his degrees and dip- 
lomas. This has led many to seriously question the accuracy of exami- 
nations. One of the earliest studies exploring the efficiency of the essay 
examinations was made as early as 1888 by Edgeworth (1888) in the U.K. 
Other studies followed suit, of which those done by Starch and Elliot 
(1913) in the U.S.A., and Hartog and Rhodes (1935) in the U.K are 
very important. Thus in the early half of the 20th century, systematic 
research on examinations was well started. 

Advances in various branches of sciences helped examination re- 
search a great deal Examination experts no longer remained satisfied 
with guesses. The scientific research approach came to be adopted for 
t e stu y o almost every problem m examinations. In the beginning, 
attempts were made to minimise subjectivity in marking. Many kinds 
evnwi?t VC ty P- e tCStS worked out - Various scoring methods were 
cenU r <x° TIT cha , nce factors m scoring. Different satistical con- 
deveWri - dlty ’ d,fficuUy and domination indices were 

Methods of J^ retall0n ° r examination results also underwent a change, 
sation and e ^ oIved for interpreting marks. Test standardi- 

nation movement 8 "/T* ° f i 0 ™? completel y revolutionised the examt- 
British) or “Psvri, A t n f w , d ‘ s i cip,inc known as “Psychometry” (by the 
came to be regarded (bythc Amer icans) was developed, which 

on rigorous statistical and^ P ‘ ^ PUtS ltf l * both a science ( based 
at its best, by a high ™ e . thods ) and an art (characterised, 

insights into human be£viour.)“ SU "’ Crcalive ,ngenmty ’ and 

at different riaS ThU ' th f samc «P«iments were often repeated 
Places, This was probably because of a general feel. ng that 
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since teaching and learning conditions, which vary from place to place 
nave much bearing on examinations, research findings of one place may 
not be universally valid The following statement of the Examination 
Committee (1962, p 5) alludes to the above feeling 

Methods of testing should be related to practices of teaching and 
conditions of study, and should be capable of adjustment and 
adaptation to particular situations and needs 
The same feeling is voiced by Dave (1966, p 9) 

Successive commissions, committees and study groups on education 
have pointed out the evils of this examination system The remedies 
to all these evils cannot be found to the researches done 
elsewhere in the world, and only careful studies and investigations 
can offer solutions to problems vexing the people concerned with 
education. 


Definition of Terms Used* 

The two basic concepts of reliability and validity in examinations are 
used throughout this chapter and are understood to mean the following 

RELIABILITY 

This “refers to consistency throughout a scries of measurements ” 
(Cronbach, 1960, p 126) In other words, “The reliability of any set 
of measurements is logically defined as the proportion of their variance 
that is true variance ” (Guilford, 1956, p 436) 

The problem of reliability of the essay tests may be divided into three 
categories 

Reader Reliability 

The “correlation between the marks of two readers is known as 
‘the reader reliability’ of the examination" (Gulhkscn, 1950, p 211) 

Content Reliability 

“The reliability of an essay test corrected for attenuation due to the 
inaccuracy of reading has been termed the content reliability of the cjsay 
test”. (Gulhksen, 1950, p 214) 

Total Reliability 

This is a term coined by Harper (1966 a, p 12) and defined by him os 
“the correlation between parallel examinations marked by parallel 
examiners That is, it is the correlation between examination 1 marked 

* In this chapter, terms have been used which have currency m the bieratun: In 
the nett charier we will re-define some oC these more critically, for the purpose* of 
the present research study 
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by examiner A and examination 2 marked by examiner B This is the 
reliability coefficient with which we should be most concerned in 
paring the reliabilities of essay and objective examinations, « ™ 1 * 
practical decisions of the degree of trust to be placed in an examination 

mark ” 


VALIDITY 

“A test is valid when it measures well what it is supposed to measure 
(Micheals and Karnes, 1950, p 104 ) To be valid a test must be reliable 
But the reverse is not true A test may be highly reliable, but it may oe 
invalid for the purpose it is supposed to serve 

The following terms have been used interchangeably in the study 
“reader” and “examiner” ^ 

“essay test”, “essay examination”, “essay-type test” and 
“essaytype examination’ , 

“objective test”, “objective examination” “objective type test 
and “objective type examination” 

“score reliability , “test reliability ’ and “total reliability 
“grading reliability”, “marking reliability ’, “examiner reliabi 
hty” and “reader reliability’ 


Summary of Early Studies 

A comprehensive and scholarly review of the studies on examinations 
done prior to 1958 has been made by Basumallik (1959) Unfortunately, 
the review is marred by a few points, some minor and some major, which 
may mislead the lay reader Without in any way detracting from the 
over-all value of the review, it seems wise at this point to call attention 
to these points 

While introducing the concept of reliability (p 128) Basumallik 
states, “In a general sense reliability means the accuracy with which a 
measuring device measures whatever it is expected to measure ” (Italics 
arc the present researcher’s) The definition, as it stands, best fits the 
definition of validity The italicised portion should have been replaced 
by some thing such as “whatever it does measure” Many researchers 
would disagree with Basumalhk’s statement (p 149) where he says 
It is needless to reiterate that in comparing essay and objective test 
we should not consider reader reliability for obvious reasons Com 
panson should legitimately be made between test reliabilities of the 
tv.o types of examination The test reliability, for essay examination, 
should be the content reliability or the reliability estimated by the 
methods of test re test, analysis of variance, etc It is desirable to 
compare coefficients of a similar nature 



REVIEW OF THE LITERATURE 


57 

On comparability of essay and objective tests, Guffibcn (1950 p 212) 
observes “If two parallel essay examinations are matched just as success* 
fully with respect to content as are two parallel objective examinations 
the correlation between the two parallel essay forms will practically 
always be lower than between the two objective forms, owing to the 
fact that the unreliability of reading will still further lower (he correlation 
between the two essay forms ” Thus, as Harper (1966 a) holds, for all 
practical purposes only total reliability, and not content reliability, of the 
essay test should be compared with reliability of the objective te,t Since 
reader reliability of (he objective tests is I 00 (i e the objective examma 
tions completely eliminate this source of unreliability), ignoring it in 
the essay examinations presents an unrealistic picture 

Test reliability is a function of test length This Basumallik has also 
noted in his review (p 150) Obviously, a 24-hour or so essay test (as the 
Chinese seemed Co have) would be more reliable than a half hour objective 
test Hence for comparing reliabilities of the essay and objective tests the 
testing time should also be equated But while giving comparative figures 
of the essay and objective test reliabilities (p 150) Basuraalhk does not 
record testing times Explaining the comparative figures of the essay and 
objective test reliabilities, he (p 151) states, "The median reliability co- 
efficients, however, are 85 for essays and 93 for objective tests res 
pectively ” The direct comparison of reliability figures implies that testing 
time for all the tests whose reliabilities are under comparison is the same 
But his further discussion indicates that testing time varied for different 
tests It may be pointed out that direct comparison of the reliabilities of 
tests of unequal lengths is technically objectionable and is misleading to 
laymen 

Basumalhk further states (p 151) that * the difference between (be 
two ( 85 and 93) is only S, in favour of objective tests ’ (Obviously 8 is 
a printing mistake for 08 ) Similarly, at another place on the same page 
he states that "These reliabilities range from 80 to 96, with a median 
of 90 which is only 05 greater than the median essay reliability stated 
above" (t e 85) In explaining differences between reliability figures he 
ignores a vita! point that reliability figures do not represent a linear scale 
It is z co-efficients whose scale is linear Thus he should have converted 
the reliability figures to z\ obtained the difference between the two 
z’s again converted the obtained difference of the two z’S to r, if he was 
interested in finding the difference between the two reliability figures 
The present researcher doubts that even this approach, though statisti- 
cally sound, would be able to give any idea about the relative merits of 
the two tests The best and probably most easy to understand approach 
for such comparison seems the application of the Spearman-Brown 
Prophecy formula This formula tells us that a test of reliability of 85 
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has to be lengthened 2.34 times to reach the reliability of .93. In other 
words, the difference between the reliability figures o f .93 and .85 is not 
nominal, as Basumalhk’s inference seems to sound. This point should 
have been made more clear in his review to avoid any misconception 
in the minds of non-technical people. 

Despite the above-noted minor negative criticisms it may again be 
emphasised that Basumallik’s is an excellent review and the quality of 
his work may not be looked down upon for some minor lapses here and 
there. The following conclusions are drawn from his review. 

One must say that reliability of a usual essay-type examination- 
reader as well as content — is not as high as that of an objective test. This 
is true even of subjects like mathematics and chemistry. An objective 
test, even when imperfectly constructed, is not likely to have reliability 
lower than .70, whereas the usual essay test will have still lower reliability. 
It may be a problem for further experimentation to find out how best an 
essay test can be improved. Under certain conditions, e.g. team impres- 
sionistic marking, however, an essay examination can be so improved as 
to stand fair comparison with an objective test in point of test reliability. 
But these conditions are not always feasible to employ in actual situations 
involving a huge number of examinees and requiring quick decision. 
More often we have to depend upon a single marker in a real situation, 
and the marking of a single examiner is found to be considerably errone- 
ous. On the whole, the research on essay reliability, done so far, presents 
incomplete and controversial evidence on many issues. Intensive re- 
search is needed to see how far it is possible to improve the reliability of 
an essay. 

Basumalhk has reviewed the earlier studies so well that it did not seem 
necessary to repeat this work. The present researcher, therefore, confined 
himself to reviewing later studies as well as some important studies 
omitted by Basumalhk. 


Examination Research Abroad 

I Conversion of Essay Items to Objective Items 
Some foreign and Indian experts (Vernon, 1940; Wrightstone, Just- 
man and Robbins, 1956; Thorndike and Hagen, 1961; Examination 
5° ™ U f C - Indla . 19 «; Standards Committee, 1965) hold that 
’ KtS “? asure ,hos = 'arger outcomes of education which 
?948 C “ by °‘ het POPet-and-penc, 1-type tests. Others (Sims, 
teu'is l “ u’ 1951 = N °"' 1957) h0ld ,h at this claim of the essay-tyP* 
generally accepted without proof, evidence, or supporting logic. 



REVIEW OF THE LITERATURE 


59 


Implicit in the claim of essay type tests is that objective tests do not 
and cannot measure the larger outcomes of education Some experts 
(Ballard, 1923, Diederich, 1957, Leo Nedelsky, J957, Engelhart, 1957 
Ryans, 1958, Fredenksen, 1960, Kelly, 1963, Harper, 1963(b), Hubbard! 
1963, Hill, 1964 a) hold that most of the educational outcomes measured 
by the essay tests can be measured by good objective tests Ebel (1953) 
and Anastasi (1966) hold that the impression that the objective tests can 
measure only trivial educational objectives is probably due to the fact 
that such objective tests can easily be prepared by unskilled writers 
However, some studies have tried to find out empirically as to what 
extent the material cohered by the essay tests can be made objective 
Sims (quoted m Woods, 1953) showed that out of 243 essay questions 
m elementary education examinations, 103 were simple recall 97 short 
answer and only 43 discussion type questions In secondary education 
examinations, out of 215 questions, 54 were simple recall, 63 short answer 
and 98 discussion type questions He concluded from his study that only 
"30 5 per cent of the questions (were) m reality subjective questions and 
better than two-thirds of the grand total studied could have been convcr 
ted to objective type tests’ 

Orleans and Sealey (quoted in Woods, 1953) while not making a 
statistical study of the problem, did call attention to it by giving an essay 
test and then converting it to an objective type test to show the thesis of 
objectivity in essay questions 

Woods (1953) found that in one hundred separate essay tests studied, 
there were a total or 1 456 answer items expected as shown by the teachers’ 
suggested answers Of that number, 1,442 were capable of conversion to 
objective type tests The study suggests that if the essay type questions 
which can be convened to the objective type tests are handled thus, there 
would be much saving of teacher pupil time, and students would be 
better able to understand the fairness of the grading The study also 
suggests that teachers use the essay tests where objective tests would have 
done the job better, and teachers frequently do not accomplish the 
expressed goal of the essay tests of measuring pupil s thinking organisa 
lion and expression due to improperly worded questions 

Moore (1954), giving a concrete example has shown how one essay 
question (for graduate medicme) can be covered by seventeen objective 
items Assuming that this example is average, he concluded that a five- 
question essay examination may be covered in eighty five objective items 
The usual objective examination of two-to three hour duration contains 
two to three hundred items, which could correspond to twelve to eighteen 
essay questions Thus, the objective examinations can cover a broader 
spectrum of knowledge than the essay examination in the same time 
period 
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IT Reliability and Validity or the Essay Tests 
Several investigators have explored the reliability and validity of 

cssa> type tests , 

Wiseman (1956) had 30-minutc compositions written by a group 01 
173 students and graded by four examiners This exercise was repea - 
with another title four months later, and the essays were graded by the 
same four examiners Correlation between the aggregate marks on the 
first and second essays was 89, corrected for the range of ability in the 
sample, it was calculated to be 92 This means that the total reliability 
of the essay tests is high for the pooled marks of four readers 

Wiseman (1956), in a validity study on a different sample (N=H7)* 
correlated entry test scores including an essay with the following live 
criteria 

1 Total School Certificate results 

2 School Certificate English Language 

3 School Certificate English Literature 

4 Teachers’ over all estimate 

5 Teachers’ estimate of written English 

Multiple r’s calculated for the whole battery, and for the battery 
without the essay mark included, revealed that the addition of the English 
essay increased the size of the coefficient in the case of all the five criteria 
However, it may be added that the design of this experiment is open to 
criticism on the following grounds 

Selection of the sample was biased m favour of students regular in 
their studies It is also probable that school certificate standards differed 
significantly from school to school, because all the schools selected for 
the study were not of the same quality Hence the inference based on the 
correlations with school certificate may be misleading Thus, the findings 
of this study may only be considered tentative pending corroboration 
by further evidence 


Penfold (1956) explored variability in marking essay tests Five- 
mmute essay written by 16 000 candidates were marked by impression by 
sixteen examiners, each examiner marking one thousand scripts After 
an interval of time (the exact period is not reported) 165 scripts selected 
at random from the whole batch were all re marked by fifteen of the same 
examiners, the sixteenth not being available The results indicated that 
there was significantly high variation between different examiners’ 
standards of marking and also between the marks of the same examiner 
on the two d^erent occasions or marking ” 

From the above results Penfold anticipated that in all probability the 
Kst chance of improving consistency would be to devise an analytic 
system of marking longer essays He, therefore, studied the analytic 
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marking of20-m mute essays Sixteen examiners were involved m mark 
mg including the chief examiner At a standardising meeting the exami 
ners discussed the details of marking To be quite sure of the conform!) 
in marking each examiner marked a photostat set of 25 scripts judged 
by the chief examiner to cover the whole mark range Finally after a 
prolonged period of marking to this scheme the extent to which the errors 
had been reduced was checked by an analysis of marking of twenty more 
photostatted essays The results suggested significantly high variations in 
marking between the examiners in spite of the extensive training and 
experience 

It may be observed that no examining body can afford to give that 
much intensive training to us examiners as was given in this experiment 
This is another evidence to prove that variations between the markings 
of examiners is an inherent and incorrigible weakness of the essay type 
tests 

P/dgeon and Yates (J957) found test re test reliability for the essay 
tests to be 77 if the same examiner re marked and 72 if a different 
examiner re marked In a follow up study of 473 students who had com 
pleted their secondary school course, they found that the old type (i e 
e$sa> type) English examination provides significantly less satisfactory 
forecast of subsequent success in secondary schools than is obtained 
from the results of an objective lest 

Hudson (I960) found that (I) neither the Cambridge FRS s nor 
the Cambridge D Sc s had better degree classes than those of their 
respective control groups and that (2) although Cambridge FRS s 
were roughly three times as num°rous as those from Oxford they were 
far less likely to have a first class degree He concluded that if our 
result is accepted as approximately accurate it would appear that class 
in Tripos bears no direct relation to subsequent eminence in scientific 
research as assessed by either of our two criteria 

Valin (1961) found total reliability of the essay tests b) correlating 
marks on two test3 to be 12 for Arabic Philosophy, and 52 for Mathe 
matics Valin has not reported testing time As such it is not clear whether 
the very low reliability of Arabic tests was due to short lengths of the 
tests or due to subjectivity in the essay type tests 

French (1961) got 300 short essays wirtten as home work by college 
freshmen marked independently by ten English teachers nine social 
scientists, eight natural scientists ten writers or editors nine lawyers and 
seven business executives on nominal compensation The examiners were 
asked to use their own judgement in marking and sort the papers into 
mne piles in order of merit The average correlation between all the exam 
tiers was 31 Out of the 300 essays 101 received all nine different grades 
and no paper received less than five different grades. The average correla 
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tl0n between English teachers was 41 The higher correlation among 
English teachers was attributed to their less erroneous markings He 
concluded that nothing at all is being measured sery well by the essays 
Many teachers feel that reader reliability may be improved if essays 
are marked on generally accepted criteria Fostvedt (1965) explored 
the criteria for evaluating high school English composition Using 
generally accepted criteria, thirty experts gave numerical ratings to 
twenty themes He concluded that perhaps the only conclusion justi- 
fied by the study was that, although teachers of English composition 
might feel that criteria arc important in evaluating the:>e», there was no 
evidence of consistency in the employment of such criteria 

Lehtovara (1966) in a study on a sample of 1330 students concluded 
that the validity of the essay examination could be improved by the 
modern test theoretic methods The reliability of the grading of test 
papers was found to be 85 This is one of the very few western studies 
where reader reliability of essay examinations was found to be above 8 
It may be recalled that the grading reliability of 85 , which looks very 
high, is far below the read er reliability for the objective type tests For 
an objective test scored carefully with an agreed key, the reader reliability 
is 1 00 Even for careless scoring, the reader reliability of the objective 
tests is far higher than 85 Ebel (1965 p 312) remarks “Even if over 
half of the answer sheets in a set of thirty-five were scored wrongly, 
with errors like those shown below, the coefficient of scoring reliability 
would still be closer to 99 than to 98 ” 


Total 

Size of Scoring Error 4 3 2 1 0 

Number of Errors 2 2 5 li 15 35 


It may be stated that a test of reliability of 85 has to be lengthened 
approximately nine times to reach the reliability of 98 

Myers (1966) had essays of over 80,000 students double marked 
In a period of five days, 145 readers gave each of the essays two in- 
dependent readings on a four-point scale The average reliability of 
l e judgements calculated by analysis of variance method among all 
readers and across all papers ranged from .26 to 49 over a five day 
period 


One of the arguments constantly brought up in favour of the use 
However*? cxa ™ n ^ ons 15 that they purport to measure creativity 
. ' a§e 5 96 P ln a s( udy of a sample of essays written by 
measurement °r Students J ud 8cd by independent examiners found the 
rated hv h, ° rcreatlv,t y 0 » 'stay tests to be the least reliable attribute 
rated by human judgement, and mechan.es the most re!, able 
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Coffman (1966) mvaslisated the talidily of the essay ijpe tests to 
predict writing ability The students who were the subjects of the re- 
search were in classes XI and XU in 24 American secondary schools 
Each student toot six objective and two semi objective tests and wrote 
five essajs of different lengths Each student wrote on five different 
topics and each essay was marked by five trained examiners The sum 
of these twenty-five marks was the student s criterion score The reader 
reliability of one essaj read once, that is, the correlation between two 
independent readings of one essay ranged from 1 1 to 34, the average 
being 26 which is a very low figure The validity coefficient of essay 
tests ranged from 33 to 55 He concluded that “m order to obtain 
validity coefficients comparable to those obtained from an one hour 
objective test using only essay tests, it would be necessary to assign at 
least two topics to each student and have each read by five different 
readers, or to assign three topics and have each read by three different 
readers " Vernon (1940) makes exactly the same recommendation for 
attaining a high reliability for the essay-type test, that is, to have a 
minimum of three essays from each pupil, each marked by three mar- 
kers A further analysis of the data of Coffman et al (quoted in Hill, 
1967 c) showed the following major findings 

1 “ ‘The reliability of essay scores is primarily a function of the 
number of different essays and the number of different readings included’ 
These factors arc more important than the length of the essays, and they 
have more value than analytical reading” 

2 "Validity is lost when optional topics are given ” la a situation 
where choice is allowed, “a student’s rating might depend more on 
which topic he chose than on how well he wrote ’’ 

3 “ ‘When objective questions specifically designed to measure 
writing skill are evaluated against a reliable cnterion of writing skill, 
they prove to be highly valid 

4 “ 'The most efficient predictor of a reliable direct measure of 
writing ability is one which includes essay questions in combination 
with objective questions ’ A combination of two objective tests and one 
essay, requiring a total testing time of one hour, correlated most highly 
with the criterion However, the validity added by a twenty minute 
essay is small, and it requires that the essay be marked by at least three 
examiners ‘It is doubtful that the slight increase in validity alone can 
justify the increased cost * ” 

Bracht and Hopkins (I96S) found reliability coefficients of two 
20-munite objective tests, calculated by KR -20, to be 40 and 50 
The reliability coefficients of three essay tests of the same time length, 
calculated by the analysis of variance method (that is content rehabi ity), 
was found to be 58, 26 and 69 As already observed, the content 
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reliability of the objective test is comparable to the total reliability of the 
essay-type lest. If reader reliability were taken ln, ° ac “ unt - 
essay test reliabilities reported would be reduced considerably, while 
the objective test reliabilities would remain unchanged. Thus the essay 
and objective test reliabilities as reported here should not be considerc 
as directly comparable. The other major findings of this study are as 

follows: . - h 

Cumulative grade point average correlated very significantly ' 
performance on both types of tests, but the correlation with the objective 
tests was significantly higher. Writing ability was significantly related 
to both the essay and objective tests, the difference between the corre a 
tions being not significant. The findings of this study do not support tc 
common supposition that the objective and essay tests measure diffe- 
rent abilities. In addition, there is clear evidence that the objective test 
is a more reliable measure of academic achievement. 

Klein and Hart (1968) investigated the chance and systematic factors 
affecting essay grades by getting the scripts of over 1500 students of 
Taw marked by seventeen professors. They found that students tended 
to be consistent in their performance in different essay questions used 
in the law school. The agreement in grading among the professors 
was .76, 


Hi. Conclusions Suggested by the Foreign Studies 

1. Most of the material covered by essay tests may be converted 
to objective tests. Much more content area may be covered in testing if 
essay tests are converted to objective tests. 

2. For short essays, reliability is very low. 

3. There is significantly high variation between essay markings 
of different examiners, or of the same examiner on two different occasions. 
Intensive training for standardising markings seemed to have little effect 
in reducing the variations in markings of different examiners. 

4 Creativity, which is supposed to be the major quality assessed 
by the eassy test, is least reliably graded. 

5. In essay examinations in order to get a validity coefficient com- 
parable with that of a one-hour objective test, it is necessary that at 
least two topics each marked by five independent examiners or three 
topics each marked by three independent examiners be obtained. 

b The essay and objective tests do not necessarily measure diffe- 
ren a uties. On the other hand, the evidence shows that objective 
^ S 7 are v S 'S' fiCa - nUy measures of academic achievement 

a i ity is lost when optional questions are provided, and -a 
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student s score may depend more on what topic he wrote than on how 
well he wrote 

8 The predictive validity of essay tests in forecasting success m 
succeeding examinations is significantly less than that of objective tests 

9 Essay examination results are not valid measures for predicting 
eminence in scientific research 

Examination Research in India 

Many areas in examinations have been covered by research studies 
in India These studies can be broadly classified in two sub heads 
Functional aspects of cxaminaf ions and efficiency aspects of exam nations 

I FufsCnovAL Aspects 

Studies on the functional aspects or examinations have produced 
what is essentially a survey of the statistics about different examina 
horn in India A Mysore University survey (1965) showed that pass 
percentages for P re University examinations vary from university to 
university in India In 1965 the pass percentage for the Pre University 
examination for the Mysore University was 30 whereas it was 55 for 
the Bombay University The pass percentage for the same examination 
even for the same university vanes from year to year The pass percentage 
for the Intermediate examination varied from 26 in 1935 to 48 m 1945 
in the Mysore University Mukcrjee (1954) in a study done in West 
Bengal found that the pass percentage for the Matriculation examina 
tion varied little from 1912 to 1945 He inferred that pass percentag-s 
remain almost static from year to year Dave and Patel (1966) DEPSE 
(1962 a and Undated a b) and theSSC Board Maharashtra (1963 a) 
found that the pass percentages of regular students are markedly higher 
than that of the private candidates The S S C Board Maharashtra 
(I960) found that there was better pass performance among urban 
students than among their rural colleagues They found that at the age 
of 19 20 years girl students do better than boy students while at the 
age of 16 19 years boy students do better than girl students Jt was 
also noticed that the mean percentage of successful candidates rapidly 
declines with the advance of age above 16 years This tendency was 
found both for boys and girls The regional languages were found to be 
more efficient as the medium of instruction than the English language 
in enabling the students to score higher marks in Science History and 
Geography 

Chitrakara (1961) DEPSE (undated a) Deshmukh (undatvdj 
Bhanot (1961) Deshmukh and kamat (undated) The M S University 
Baroda (1964) the Mysore University (1965) PiHai (1965) and the 
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ssc Board, Mahirashtri (1960, 1963 a) studied the wastage and 
stagnation among students The wastage and stagnation were found Tor 
each class to be between 40% to 70% The general causes of the wastage 
and stagnation as inferred by these studies are poverty, lack of proper 
atmosphere at home and school, want of equipment and qualified staff 
m schools Lele et al (1966) analysed the failure rates of the Preparatory 
Commerce examination of the M S University, Baroda for the years 
from 1950 51 to 1964 65 (The Preparatory Commerce examination is 
a one year course where students are admitted after passing the SSC 
Examination ) They found that there was a significant upward trend in 
the failure rate m the Preparatory Commerce examination Lele ct al 
inferred that assuming that the quality of the SSC examination, as 
evinced b> the SSC examination marks, remains the same, the Pre- 
paratory Commerce examinations have become more difficult 

Mitra (1958) and Misra (1968 a) factor analysed examination 
marks of students to find out common factors underlying different 
academic subjects Mitra’s study does not provide any conclusive evi- 
dence, probably because the number of cases in the sample was xery 
small Misra (1968 a) found three factors in the Matriculation examina- 
tion of the Gauhati University The first factor, which was dominant, 
was named “\erbal * The second and third factors were almost of equal 
order, and were named “problem solving’’ and “memorisation” 
Contrary to his expectation, Misra did not find a numerical factor 
He stated as the probable cause that the condition of having at least 
two variates for each ability was not satisfied in his experiment in the 
case of the numerical factor Lele et al (1964) factor analysed the S S C 
and P Sc examination marks, the two being successive examinations, 
for the years 1957-58, 1958-59 and 1959-60 They found two common 
factors, namely, “\erbal” and “numerical” 

Roy (1969) found that the failure rate in Mathematics inHSLC 
examination is on the increase It has shot up from 45 7% in 1966 to 
76 9 /0 in 1968 He attributes the reason of failure to the poor quality of 
students appearing in the H S L C Mathematics examination 

The studies on the functional aspects of examinations provide 
useful information but their number is very few On some points the 
findings are contradictory, which suggest the need of further investi- 
gation Moreover, the value of such studies is limited as regards 
assessment of the effectiveness of examinations as a tool of 
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va ui precise and economical Slad.es on the efficiency aspects of Mam, 
nations may be classified ,n ,he follottmg sab beads variation 
mgs marking errors reliability of marks and v altdtty of marks 


I VARIATION IN MARRING 
(a) Variation for the same set or scripts 

There arc several studies on the variability of marking the same set 
of scripts by different examiners or by the same examiners on different 
occasions 

Muherjce (1961) had two essays one in Assamese and the other m 
English each marked independently by five competent examiners The 
five examiners for the Assamese scripts awarded marks ranging from 
14 to 29 The five examiners for the English scripts awarded marks 
ringing from 6 to 22. The two scripts were marked by the same examiners 
after a week The highest difference in marks given by the same examiner 
to the same scripts on the two occasions was 12 (i e 31-19) for Assamese, 
and 1 1 (i e 22 1 1) for English The mean marks awarded to the Assamese 
scripts varied from 26 4 to 18 6 and for the English script from 15 6 
to 8 4 on the two occasions As is evident from the study not only did 
the examiners differ from each other, they eien differed from them 
selves on the two occasions— i e after a lapse of seven days 

George (1964) studied the difference in examiners markings in the 
Pre University Test of the Kerala University by selecting a random 
sample of 40 scripts each of English I English II Mathematics Physics 
Chemistry, Botany and Zoology These scripts were marked independently 
by two separate external examiners of the University Except for English 1 
and Zoology in all other subjects the differences between the markings 
of the external examiners were found significant at the 1% level In 
Zoology the difference was found significant at the S c ‘ <) level 

George (1964) in another experiment had a large number or scripts 
marked by two independent University examiners in the above noted 
subjects of the PU students of the Kerala University He found that 
the percentage of disagreement in classifying students in different 
categories that is f II HI and Fail ranged from 22 6 m Mathematics 
to 52 2 in Botany In Physics one student who nas/m/edby one examiner 
was given a first class by the other examiner In Botany there were two 
such students In Mathematics one student who was given a fait and 
twelve students who were given a third class by one examiner were given 
a first class by the other examiner 

It may be observed that the scripts used m both the experiments 
of George (1964) seem to have contained crosses and ticks given by the 
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original examiners. The crosses and ticks might have invited the atten- 
tion of the other examiners to the points marked by the original examiner. 
Thus, the obtained disagreement in the markings should be considered 
as the minimum obtainable in a real examination situation. 

Taylor, Tluanga and Misra (1966 b) had answer scripts of 100 
examinees of B.A. Part I English examination of the Gauhati University 
cyclostyled and examined independently by nineteen examiners. The 
mean marked from 19.2 to 35.0 (out of 100 marks) and pass percentages 
ranged from 7 to 62 for the nineteen examiners. In some cases the 
disagreement in awarding marks was extreme. For example, it was 
found that a student who got a 3rd rank from one examiner was given a 
98th rank by another. These scripts were originally marked by the 
University examiners. Out of the 100 students, the number of original 
failures who passed and the original passes who failed in the second 
marking ranged from 20 to 37 for the nineteen examiners. It may be 
observed that in the experiment the copies of scripts were not reproduced 
in the students’ own handwriting. They were typed. Thus, dis-agree- 
ment in marking due to the quality of handwriting was eliminated. 
As such, the disagreement as reported in the study should be considered 
as the minimum expected in a real situation. Harper (1967 a) had facsi- 
mile copies (by Xerox process) of 10 scripts of History examination of 
class X of a Board of Secondary Education marked by 90 experienced 
examiners of the same Board. It was found that the only student consi- 
dered fit for the award of a star (i.e. above 75% marks) was also failed 
by several examiners The script which was given first or second rank 
by almost all the examiners got from a high first class to a bare pass 
mark. The variation in pass percentages ranged from 10 to 80. This 
is the only study in India where variation in markings was explored in 
an almost real examination situation, i e. the scripts were reproduced 
in students’ own hand-writing, these were sent to examiners during the 
time when they usually examine the Board scripts, and the examiners 
were paid remuneration at the Board’s rate. Thus, the findings of this 
study may be considered as representative of variability in markings m 
real examinations. 


Harp ! r ' “ al - < 1967 ) ha d script, in Hindi, History, Geometry and 
Biology or Class X final examination of a Secondary Board re-marked, 
*5" Mra ' a "d some by another examiner. The highest 

naoerl The Tam ’ * a$ - f ° Und ,n Ge ° m etry (supposedly an "objective” 
himself bv 4 ?o/ * C5 [ am ‘ r ! er ’ re * m arking his own scripts differed from 
(another L ° Th ' h ' 8hKt mea " difference was found in Biology 

KLS2?! paper >’ a ™ a " u*™* ° r9 - 8 for 

the \ariabihtv * ’ rr ’ ar ^ ln f °f their own scripts They found that 

sanabdity ,n marking remains almost of the same magnitude 
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“ bc,, ;7 tit scripts previously marker! by 

himself or marked by another examiner 

Harper ct a! (1967) analysed the variability m marking m a diffe 
rent way also In the above-noted experiment the markings of the two 
examiners were equally erroneous They, therefore, selected one 
examiner who was considered by the Board to be hrghlj experienced in 
marking, that is, a "standard ' examiner The scripts marked by the 
standard examiner were re-marked by other experienced examiners 
of the Board The mein sanations in the two markings were found to 
range between 14% to 32% of marks in History, 12% to 24% m Hindi 
J2% to 16% in Biology, and 18% to 28% m Geometry 

Many teachers feel that marking is objective in Science and Mathe 
mattes papers The experiments of George (1964) and Harper et al 
(1967) show that m essay type tests subjectivity of markings in the 
Science and Mathematics papers is no less, tf not more than that in the 
Arts Papers 

Taylor (1962 t) bad scripts of 45 students in English (two papers) 
Economics, History, Logic and Mathematics independently marked 
by two examiners The highest difference between means was found in 
Logic where the mean mark of one examiner was 55 8, and of the other 
examiner was 460 The lowest difference between means was found in 


History, where the mean marks varied from 40 4 to 39 5 The highest 
difference in the standard deviations was in Logic, where the s d 's varied 
from 18 0 to 10 8 He concluded that "the experiment makes it dear 
that an examination mark has neither the sanctity nor the precision 
which is usually attached to it” He recommended the use of scaling to 
minimise the effects of errors in marking Some people suggest that 
to improve the reliability in marking essay type tests, fewer number of 
scripts should be given to examiners This experiment makes it clear 
that even for a very few scripts the differences in means and s d s in the 
markings of two examiners are alarmingly large 


(6) Variations for equivalent sets of scripts 

Taylor did several experiments on the variation in markings for 
statistically equivalent sets of scripts He observed that. By allotting 
roll numbers at random, or otherwise mixing the candidates one can 
ensure that the sub-groups arc statistically equivalent” 

He used this method of making sets of scripts statistically equivalent 
in the following three studtes, so that any differences in the mark 
distributions can be attributed to the examiners and not to the candidates 
Taylor (1962 b) found that in an English paper of college students 
one examiner produced a distribution which Taylor describes as 
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(5, 3) (where 51 is the mean, 5 the difference of the upper quartilc from 
the mean, and 3 the difference of the lower qunrlile from the mean) 
and the other one for the equivalent set of scripts produced the distn 
button 31 (3, 4) With one exception all the marks given by the second 
examiner were lower than am marks given by the first examiner 

Taylor (1962 b) had a first >ear college paper in Biology, involvinc 
210 candidates marked by two pairs of examiners, A and B taking 
110 scripts, C and D talcing the remaining 100 scripts The two sets of 
scripts had been made statistically equivalent by allotting random roll 
numbers The marks of each pair of examiners were averaged The 
median mark was 66 5 for A and B, and 42 0 for C and D The true 
means were respectively^ 9 and 40 4 Ta>lor concludes 

The difference in the average marks (whether one used the 
median or the mean) is thus 24-1/2 With A and B, 88 per cent 
of the candidates got 55 marks or more, with C and D on the 
other hand 95 per cent of the candidates got less than 55 marks 
The difference is even more remarkable in that each total mark 
is found by combining the separate marks of two examiners a 
procedure which would tend in general to smooth out differences m 
the standards of marking The examiners were very unwilling to 

J P ? SS,b,hty ° f SUCh large ^rences until the analjs.s 
was put before them 

martarihv d.ff 3 a) . found the P ass Percentages or the equivalent scripts 
27% tom ?M n l eM ? nerS Vary ‘ ng frcm »% W% >n English 
culation e.i.nr^r'c^Ifu' 0 ,^; *” n * ” ^ 

2 t. ‘m SmCmoN 0F st vdent S due to marking 

the marts may^o/be so^SotT S »‘ fi ? at,on ln divisions is reliable though 
Taylor (1962 c) sLied la tT.k' : explored this point 
students with an average mark nf?c thetlCaI group of one thoUsand 
He assumed that a pass was securM u ^ a standard deviation of 15 
mark of 48, and a first das* t™ \ a mar ^ °f 30, second class by a 
standard error of markinc of s of 60 He further assumed the 

the students would be wrongly He f ° Und ,hat about 25% 

m the above-noted case He ak r d m '' !I " 111 and Fail categories 
students who really deserve a m* ^"d ,n ,hls case that 9% of the 
“ sum Pt,on, he calculate?, fa ! 1 0n a paper by accident 
!! a tCn paper lamination when the 6 p ^ obablllt y of passing a student 
thC papcrs and would pass thel?l St „ Udent has th * same ability in ah 
whole examination on the basis of his 
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true marl The probability of passing simultaneously m a ten paper 

examinations was (for such a studen,)J ^ ^ of ^ 

'£?££££** -* and ,hc ° ,hcr may 

each of Hindi, History r °“"X ame Son if marled again as 
found the expectancy S ,-y> for II division 64% for 

follows in History 9% to } Id.vis.on 53°. for H 

III division and 63% for fad > f , , Biology 39% for 

division, 66% for III division and « and «% for fad 
I division, 62% for H division, 64/. to >» 65 » o for lid, vision. 

In Geometry 84% for distinction, « .to 

61% for III division and 90 “ rotsof English Mathematics Physics 
George (1964) got 3,416 senptt < f E » ^ of , he km la Uniter 
Chemistry, Botany and a °°, °f V .lLs of nearly 30“ of the students 

"ep"Xm y ofeva',uat, on Produces considerable 

degree of error 

3 RELIABILITY u „ d er lhrK heads 

a- - »- 

xssszz 

*£Sra.l« v, a. on to -~f 

Tliere are onnnraattcm or tn dl irerentiation 

certain papers of Class * al d0 not make any on |y 

Education, West Bengal y ^ re)ia b,hty IW reportc d by 
between venous tWJ f ' an °"VT' £onttn‘ y^ 

« 225- -ssr. - - w * 

for papco or 
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class X examination of the Board of Secondary Education, West Bengal 
In the first studv (1961) the content reliability was found to be 
63 for Mathematics In the second stud> (1962) it was found to be 
7 for English I and 11 each, 6 for English III, and 85 for all the three 
papers taken together In the third study (1966) it was found to be 85 
for Sanskrit, 80 for Hindi I, 68 for Hindi II, 84 for Hindi I and II 
taken together, 74 for Bengali I, 65 for Bengali II, and .70 for Bengali 
I and II taken together In the fourth study (1967) it was found to be 
80 for Physics 1 and II each, 83 for physics I and II taken together, 
75 for Chemistry I and II each, and 86 for Chemistry I and II taken 
together In the above studies each paper, except English II which was 
of two hours, was of three hours 

It may be observed that in India the above-noted studies done 
by Gayen et al arc the pioneering studies analysing a Board’s results 
by modem psychometric methods The strengths of these studies are 
many There is no study in India on public examinations where the 


sample is so representative of the population as in these studies These 
are the only studies in India which have tried to analyse every aspect 
of examinations using a wide variety of statistical tools of analysis 
Of course, from a pioneering study, one cannot expect to get answers 
to each and every problem in the field It is also natural that even such 
elaborate studies may have some weak points Without in any wav 
detracting from the overall value of the studies, it appears desirable 
to invite attention to the following points 

Their definition or difficulty index is novel le different from 
generally accepted definition of difficulty index Unless one knows the 
special meaning given to this term by them— and it is very unlikely that 
a technical man would consult their definition due to his confidence 
C .t, n0W c term W, 1I certainly draw wrong conclusions 

™L* C ' r , figurcs of difficulty index Comment, ng on the generally 
to “Th,t ° r , dlfr,cuU S' Ebel (1965, p 359) nghtly state, 

culty and ^ bstween 15 ordinarily meant by diffi- 

has beJ™ ? m!a5ure rt ,s ''logical, but the convent, on 

only to contribute?^ cordhsum ” '° a “ emPt “ ChanSC "'° U ' d 

statetoiat th'Tm 8 t**® “"“P 1 of difficulty index, Gayen ct al 

group who answered" 1 the”nern Bn? ‘’'‘’i?' 15 “ POn abll “ y ° f T 
index of an item they seem to h ® m the ca,cu ' a "°n or the difficulty 
answering the partmufe ltem ,Bn ° red the abl ‘‘ ,y levcl ° f th = stude ” tS 

ignored the malchmc ^' a,h ' ma,l,:s Paper m two halves they almost 

consideration or having' Th-'wter'Sr™? mamly Emdcd by 

K - other half which could best correspond 
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With the marks awarded in the first half This approach may lead to 
spurious reliability In the application or K R -20, they have treated 
the total of marks awarded on the several sub groups of a question as 
the item score for that question This approach would also lead to 
spurious reliability However, they have amply made it clear m a true 
researcher’s spirit that these were the only approaches which they 
thought they could possibly adopt, though they knew that their data 
did not satisfy all the conditions needed for such approaches 

Despite these minor criticisms, it may be emphasised that Gayen 
et a! have set the tone for analysis and interpretation of many problems 
of essay-type examinations, and their endeavour deserves all commenda- 
tion To quote Shakespeare 

Frail creatures are we all. To be the best 
Is but the fewest faults to have 

Misra (1968 a) investigated content reliabilities of English I, II 
and III, History, Geography and Mathematics of the Matriculation 
examination, J963 of the Gauhati University by factor analysing the 
marks obtained by 1,241 students selected from the examination on the 
basis of a random sampling He found the content reliability {obtained h*) 
to be 76 for English I, 63 for English II, 63 for English III, 60 for 
History, 49 for Geography, and 50 for Mathematics 

Misra (1968 b) in another study found the content reliability of 
English I of three-hour duration of the Pre-Universitj examination 
1966 of the Gauhati University by analysis of variance, and split half 


methods to be around 8 

Harper et al (1967) investigated the content reliability of Hindi, 
History, Biology and Mathematics of class X examination of a secondary 
board of education They found the content reliabilities for various 
examiners in Hindi ranging from 45 to 77, for History ranging from 
21 to 76, for Biology ranging from 41 to 64, and for Mathematics 
from 68 to .90 They attribute the variability in the content reliability 
of the same question paper for various examiners as due to the fact 
that different examiners may emphasise different aspects of the answer, 
eg style, factual accuracy, organisation, creative thinking, etc Since 
the proportion of each of these jn the content of answer scripts ma) 
differ, the emphasis on different aspects is likely to produce different 


content reliabilities for different examiners 

It may be noted that Gayen et al (1961, 1962 and 1966). Misra 
(1968 a and 196S b) and Harper et al (1967) found approximately e 
same reliability figures for different papers, except Mathematic^ t-or 
the Mathematics paper Harper ct a) (1967) report h.fh 
i c. 68 to 90 (median 77), and Gajen et al (1961) and M'f 3 

report low reliabilities, i c 63 and 50 respectively One of the probable 
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reasons for such d.lTcrcncc in the rclnbtht) l.furcs of M »"’ e ' n1, ‘” > 

be that Harper ct al (1967) calculated relnbd.t) for Geometry on , 
whereas Gajen et al (1961) and Misra (1968 a) calculated rcl.ab.ht) 
for the paper containing questions on Arithmetic, Algebra nnd Ccom 0 
It is possible that Arithmetic and Algebra tests may he less reliable than 
a Geometry test Thus, the inclusion of Algebra and Arithmetic in c 
question papers studied b> Gi>cn ct al (1961) and Misn (196S a) mtg 
have resulted in the low reliabilities obt lined b> them However, this 
point needs further exploration 


( b ) Reader rehahiht) 

As defined earlier, the correlation between the mark* of 
two readers (examiners) is known as the reader reliability of an 
examination 

Taylor, Tluanga and Misra (1966 b) found the reader reliability 
of English I of B A Part 1 (of three hours' duration) of the Gauh3ti 
University by getting a random selection of 100 scripts (cy dost) led 
copies) independently marked by nineteen examiners The average 
correlation among the examiners was found to be 77 and the coefficient 
of concordance (W) was around 75 These scripts were previously 
marked in the University examination The coefficient of correlation 
between the University marks and the average of the nineteen examiners* 
marks was found to be 67 The low correlation of the University marks 
with the average of the nineteen examiners’ marks seems to suggest 
that marking in real examinations is more erroneous than in experimental 
situations 

Harper (1966 b &. 1967 a) found reader reliability or a History 
paper of class X of a Secondary Board of Education by getting facsimile 
copies (by Xerox process) of ten scripts fairly representative of all levels 
of ability, marked by ninety experienced examiners of the same Board 
The average reader reliability was of the order of 83 

Harper et al (1967) found the following figures of reader reliability 
for different subjects of class X examination of a Board or Secondary 
Education 


Subject 

Range of 
reliability 
figures 

Median 

of 

reliability 

Hindi 

61-94 

79 

History 

49- 93 

77 

Biology 

67-91 

80 

Mathematics 

81- 99 

96 



Table showing reliability itgures obtained in Indian studies— (Conid ) 
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II may be observed here Ihel reader relubiblj (ieorcs reported m 
Indian studies compare xxi.II with such fijures reported in foreijn coun 
lues For example os reported earlier reader reliability was found to 
be 31 by French (1961) and 85 by Lehloxaara (1966) This shoves il 3 , 
Indian examiners are not less cllicten! than their counterparts m most 
advanced countries 


(c) Total reliability 

As defined earlier the tot it reliability is the correlation between 
the marks given by two paratleJ examiners on two parallel essay tests 
The present researcher is not aware of an> study in India which strictly 
satisfies the conditions for the determination of the total reliability of 
the essay test However, there arc studies comparing ihe marks obtained 
by a group of students on infernal and external examinations both of 
whith ate supposed to be measuring the same content These mutts 
may be rough!* taken as an estimate of the total reliability of the essay 
type tests It nvty be noted that the studies reported under this head 
ha*e {tot used the term * total reliability They have hem gtowpet J by 
the present researcher under this head as to him they seemed to sattsfv 
the conditions of the total reliability 

The S S C Board Maharashtra (I960) did a similar study by cortela 
ting marks of school test examination with class \ final examination 
marks Tor the purpose of sampling the examination area was divided 
m two regions A fairly random sample was drawn (torn each region 
The results obtained are reported below 

Subject l Sample II Samp'e 

N ReJ ability N Rehability 

Language— Higher Lcscl 
Language— Lower Level 
Social Studies 
General Science 
Elementary Mathematics 
English 
Science 

Classical Language 
History 
Geography 

Indian Administration and Civic* 

Arithmetic 
Algebra A Grom fir) 

Aggregate Total 


192 -J 9 VA 51 

379 57 562 47 

255 46 405 52 

374 *4 *» 62 

155 19 310 6<? 

327 69 432 69 

2)7 57 263 60 

224 69 332 

% .52 K*7 52 

117 40 240 <4 

95 33 136 V 

gg 61 27 “0 

145 71 JM 

3S3 fO < 67 76 ^ 
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Ga\en et al (1962) found the following correlations between the 
school test examination marks and Secondary Board examination marks 
of the same group of students in different school subjects 


Subject 

N 

Reliability 

English 1 

507 

61 

English 11 

636 

63 

English HI 

634 

44 

English Total 

494 

73 

Bengali I 

54S 

39 

Bengali 11 

ASS 

51 

Bengali Total 

523 

55 

Mathematics 

483 

60 

Historv 

636 

43 

Geo graph} 

640 

35 


Harper (1966 a) has statistically shown that the total reliability 
of an essay-type test is equal to the product of its reader and content 
reliabilities Accordingly Harper et al (1967) multiplied the figure 5 
of the content reliability calculated b> anal} sis of variance method 
with the figure of the reader reliability obtained for the same c ct of 
scrip's The median total reliabilities for \anous subjects were found 
as noted below 


Subject 


Total Reliability 


Hindi 

58 

History 

35 

Biology 

50 

Mathematics 

72 


Following Harpers approach, thal is, by multiplying the figurK 
of the reader and content reliabilities, Mi=ra (1968 b) found the total 
reliabihty of English I of P U Examination (of three-hour duration; 
or the Gauhati University to be 64 

Empirical studies done by the S S C Board, Maharashtra (I960) 
I ' 1 * 1 ’' and Datta <'967), theoretical analysis b; 
\ernon (1940) and Harper (1965) indicate that the total of marks o 
single paper 15 “ CXamiI,atlon is more reliable than the marks on l 
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Following the leadership of T L Kelley there has been a general 
tradition that to be sufficient!) reliable for discriminating between 
c*m t * u leSl sl,ou! ^ave a relrabil ty coefficient of it least 94 
some have been more libera! m this regard allowing a aunmm 
or 90 while others have been more demanding with a requ rement 
of a minimum of 96 While these qualifications mentioned 
regarding reliability and validity need to be made the Tact remains 
that in practice we expect reliability coefficients to be in the upper 
brackets of r vilucs usually 80 to 9S 
It has already been explained that for practical decision one should 
consider the total rel ability of essay tests In view of the remarks made 
by Guilford it is evident that none of the total rel ability figures reported 
in Indian studies reach even the mntnitm limit of reliability usually 
acceptable in practice 


4 r AUDtTY 

As defined earl or a test is valid if it measures well vvhai it is 
supposed to measure Statist cally validity is the correlation of a test 
with a criterion Studies on the validity of marks arc generally confined 
to the predictive validity of marks for achievement in some other eximwa 
tion 

Taylor {1962 b) correlated the marks of 120 college students m 
Chemistry Physics Chemistry Practical and Physics Pract cal The 
coefficient of correlation (r) and the standard error of the coefficients 
(s) were found as reported below 

Physics Theory vs Physics Practical r — 0 14 s -» 00? 

Chemistry Practical \s Physics Practical — ON —0)0 

Physics Theory vs Chemistry Theory — 0 30 ~ 0 O') 

The last correlation was affected by the presence of two very poor 
students If they were excluded the value of r falls to 20 whch » 
barely significant He concluded that if marks are valid measures of 
ability there is no apparent connection between practical and theoretical 
abilities in Physics none between the two practical abil tics and none 
between the two theoretical abilities As such an assumption is un 
-icceptable we must conclude that marks are not valid measures at all 

Patel {1962) in an analysis of B Ed examination marks found the 
correlation between the practical annual examination and the jear* 
work in practice teaching to be 62 The correlation between total marU 
in Part I and Part II examination was found to be 63 Assuming that t tie 
same ability underlies the various papers of the examination his study 
suggests moderate validity of examination marks 
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Kama! and Deshmukh (1961) inter-correlated the aggregate marks 
of a group of students obtained in different examinations. They louna 
the following correlations: ____ 


s.s c. 


Inter 

Arts 


Inter 

Science 


B.A. 


S.S C. 

Inter Arts 
Inter Science 


B.Sc. 


.60 .61 .49 -21 

N=296 N=1143 N=192 N=370 

.63 
N=182 

.42 

N=347 


These correlations seem to suggest a moderate validity of examina- 
tion marks. 


Misra G.S. (1964 a) in a 

similar study found the following r's. 


H.S. 

Inter- I. A. l.Sc. I.Sc.Ag. 

mediate 

Intermediate 

.71 



N=46l 


I A. 

.73 



N = 125 


I. Sc 

.84 



N— 125 


1 Sc Ag. 

.72 



N = 125 


Graduation 

.83 

.87 

BA. 

N— 461 

N— 46i 

.69 

.69 

BSc. 

N— 125 

N=125 

.85 

.87 

B Sc. Ag 

N=125 

.74 

N=.25 % 


N=125 

N=125 


He concluded that the results of High School and Intermediate 
— havc a h 'Bh predictive value for achievement in University 

M U . anBa , a " d M,sra < 1966 c ) correlated the marks of 347 
=1"^'^'°" and P - U - c^minations. They estimated the 
students uhn , %CCn tvvo examinations* making allowance for the 
P U In a ? SC ? f fan ' ne Matriculation. d,d not appear for the 

bitueln 2 - ! he,r va!uc ° f ' Bives a " estimate of correlation 

PCI ecn the l»o exammations if all the students tvho appeared for the 
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Matriculation examinations were allowed for the PU examination 
The value of r was round to be 72 They also estimated on the basu of 
the obtained correlation coefficient, that out of 15 470 students who 
had failed m the Matriculation examination 1,688 would have passed 
if given a chance to appear for the P U examination 

Lete et al (1963) found the r between S S C (English) and P Sc 
(English) marks m M S University Baroda to be 77 

Harper (1963 b) explored the problem of whether the essaj or 
objective tests are better predictors of performance on an essay t)pc 
examination He had an objective examination of knowledge of English 
poetry as the final “Test Examination" in several Intermediate Colleges 
The results were correlated with the U P Secondary Board s cxamina- 
tion in the same paper The validity coefficient ranged from — 25 to 
82 (The first coefficient was not significantly different from zero) 
In other words, the same objective examination had different validities 
for predicting the same external essay examination score in different 
institutions The difference between the validity coefficients may have 
been due to inherent differences between the institutions or they may 
have resulted from the extreme unreliability of the essa> type external 
examination 


Harper (1963 b) in another study administered two objective tests 
of the knowledge of English usage to 72 Intermediate students The 
criteria were two terminal examinations of the essay type The correla 
tion between the two essay tests each of 2 30-hour duration was 74 
The correlation between the two objective tests each of 45 minute 
duration was 75 Corrected for the length of the tests reliability of the 
objective tests was 91, which is significantly higher than the reliability 
of the terminal examination of the essay type The validity coefficient 
of the objective tests corrected for attenuation, for predicting perfor- 
mance on English essay examination was above *f 90 Harper concluded 
that “the objective test is a better measure per unit time in predicting 
actual essay writing marks, than an essay examination of the same 
length ’ Giyen et al (1961, 1962 and 1966) have also reported validity 
of some papers of class X examination of the Board of Secondary 
Education, West Bengal To the present researcher their validity figures 
seemed to fit the conditions of reliability and have, therefore been 
discussed under the head ‘Reliability* 

The above studies explored validities of marks for predicting 
achievement m other examinations However a more important function 
of examination marks is to predict performance m actual life 
(1964 b) explored this aspect of examinations He investigated me 
validity of Teacher Training Courses for predicting performance in 
teaching on a sample size of 118 male teachers of Boys Intennedate 
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Colleges of Allahabad City. His study suggested that the training or 
teachers had not improved their teaching efficiency significantly. 

It may be observed that an examination is supposed to measure 
the development of various abilities and skills which education aims 
to bring about in students. How far the present examination system 
measures the development of such abilities is an area almost completely 
unexplored in this country. 


5. DISCUSSION 

Before concluding this part of the review it seems desirable to 
make a note on the practical utility of “validity” and “reliability”. 

Validity indicates the degree to which a test is capable of achieving 
certain objectives. There are three types of validity generally recog* 
nised, i.e. content, criterion-related and construct. As already stated 
earlier Indian studies are mostly limited to predictive validity. 

Validity is the most important quality of a test. However, valida- 
tion of a test requires a generally accepted criterion which seldom exists 
So in practice a test is considered to be valid, if the questions are relevant 
to the curriculum and the test is reliable. 


As already defined, reliability refers to the consistency of measure- 
ment. The consistency of measurement may be considered from two 
aspects: relative consistency and absolute consistency. (For details 
*ec Thorndike, 1949). 

Relative consistency: The coefficient of reliability, which is 
the correlation between two sets of scores, provides an index of relative 
consistency. This, being a correlational concept, suffers from the inherent 
weaknesses of the correlation coefficient. The correlation coefficient 


assumes that both the measures are normally distributed, an assump- 
tion which is often not satisfied in the case of many educational tests. 
The farther the deviation is from normality, the more distorted is the 
Nalue of r. Besides, the coefficient of reliability is influenced by various 
factors other than the intrinsic agreement between the two measures, 
eg . the length of the test, consistency of the test items, the method 
by which the reliability was determined, testing conditions, and 
nd m ' 3 -'i‘ b "| ty n th ' ablUty kv<:1 ° r ,hc eroup on which the test was 
anv r red ' 7 hUS T has alwlys 10 <* on one's Buard before drawins 
any inference from the coefficient or reliability. 

an index rf 6 ? on !‘ stenc y - The standard error of measurement gtv es 
prourMo cons '«"ncy- It IS relatively independent of the 

mS P b) GmlK 19 S ' Ven - (See als0 Har P= r - ,95!l and M,sra 

n-nl , ( ' V 5,) h0,ds ,hat th ‘ sta ndard error of measure- 

' a bs,,tr way of expressing the reliability of a test than the 
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reliability coefficient as takes Recount ‘be sanabbty with, a the 

group as tsellas the self correlate" oate.es. ^ ^ ^ 

However both the reliability . ed marks For unsealed 

of measurement have limite va ue coefficient to approach 

marks it is quite possible for the reWbi^ y approach0 and ye , on 

In^^SudenUmly pass and on the other every one may fad 

efficiency of an essay tes i engt h are given to the same 

that when two essay W ’ °f lh s * me4 „„ ,he basis o 
group the efficiency of the test shorn ^ of d|fferH1 , leog ,ha 

reliabil ty coefficients However , consider the standard 

- 

I IMPROVEMENT IN CORRICOLU n | 950 Secondary 

Many experts S?S5?l £ 1 

Standards Comm ttce H || , 967 , kaul Adalat > ^ „„ 

^^^"re'ncd^well^co^ordmated^in^tndia^Thit^ 

be improved S o^ r f E So„ Committee _ A ? “ Jn corncula should 
et al 1961 Secondary W ^ Sccond ary Educa ““ , of students 

1967 c 1967 r 1967 f^ Ie r to the needs of var '° ’f (b) those who 
bc so improved as <“ , fe after passing exam enl « h gher 

(a) those who wonld enter ^ (c) those who ^ [jdish 

would enter techn ca' ,t appears that overlooked m 

education «U1 d«» d hoM^ of sw den » 1966c 

is » -econd 'anguagejorag of English ^ should 

framing objective recommends knowledce to n 

I967g I967f and ‘7„; hl nV. interpret and apply 

measure ability to r 
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situations rather than the ability to reproduce memorised facts This 
seems to suggest that the curriculum should be clear on what abilities 
are to be developed by teaching a particular course content 


2 IMPROVEMENT IN QUESTIONS 


(a) Specificity in questions 

Cieslak et al (1959) recommended that the essay question roust 
carefully define the problem so that students need not guess what e 
examiner wants, the basis of grading should be clear to the examiners 
and the students 

Ray (1970) recommends that questions should be improved so 
that they measure the educational objectives set by the curriculum 
To be specific, a question should have no room for different interpret 
tions by paper setter, examinees, and examiners as to what should 
constitute the answer to the question 

Specificity in questions has been explored by some studies in India 
Gayen ct al (1961), Lelc et al (1962 a, 1962 b), Hill (1967 f), and 


Misra (1968 b) found that some of the questions asked in our examma 
tions are vague Such questions encourage guessing on the part of the 
students and lower the reliability of the test Hill (1967 0 states that m 
the traditional question papers there is usually no clue, except the 
number of marks allotted to the question, to indicate the length and 
scope of the expected answer Such questions give no stimulus to 
precision in learning 

Lele et al (1962 a, 1962 b) found that there are questions which 
are considered to be simple by students and difficult by teachers They 
concluded that it is probably this differential difficulty of questions as 
seen by teachers and students that makes essay examinations less reliable 

Some experts have studied the problem of the language of question 
papers Gayen et al (1961) state that * the present system of teaching 
a subject in one language, but setting of questions in another creates 
confusion in the candidates, so the media of instruction, of paper setting 
° nJ , °1 exa ™ nlt >““ (answering) mast be one and the same' The 
,'““ 7 Education Committee, Assam (1965) issued a questionnaire 

subs and e ,N ° Pmi °" S ° n lh,S P ° mt ’ e " wither ‘here was any 
co “ phmt lhat *= practice of settmg questions in 
reslnslle ror r1 Wm ° f "^cbon was the regional language was 
Some of the h m j aTEC P ercenta ge of failure in the examinations 

agree with this vied^ affirmal,ve 0thers d,d ” 0t 

Hidi School r ~x n A,A i 6 Headmaster answered “A Matriculate or a 
or English as to imder’i t0 W at least thls much knowledge 

E “ '° U " dcma " d «■= meaning of a question set m English 
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, .. u __ v k e observed that however important the ability to 

comprehend the English language 

English 

(b \ increase »n the number of questions 

The problem of having !• ™60, 1961 

question paper has b *®“ S “ l~ f 1967 g 1967 h, Misra 1968 b) They 
1963 b. 1963 d, Hill 1965, 1967 f,1967 g 1^ ^ ^ tradUlona l ; 
hold that because of touted t ^ ,he evaluation of 

type question paper, a high d of more compulsory ques 

achievement They recommend ^ '“^^overage of course contem 
tions in the question paper, “ h) Mls[a (1968 b) and Tnvedi 

Harper (1960), Hill (1967 f, ^f'^ype questions in the question 
(1970) recommend the use . af short i«w= VP “ th3t the essentia 
paper, as they feel that there . no etude ■ mel!Ur!d b> questions 
characteristics of ««« ' ““fences, or a page a. the most 
requiring answers m one or 

(c) Reduction in the timber of optional g ^ of ttacte ts ,n 

One problem ** - should be allowed in 

India and abroad is whether option „, lons should 

examinations or not giving choice among qu necessary 

Ebel (1965) recommends that E ^ such optmnak ne * 

%£&Sr‘ India w,ch;- 

Sharma 0963) suggests the P™™° n p ^°e a ddfewj ; ^/^lonal 

a way that the alternatives d 3(Jv ocates ab °' ' vb0 ,s espeded 

for the students , D That the candidate who^ ^ 

questions on the following 8*^ {o sclccl them » the strain of 
to answer questions is also q of „ ml . That |0 w cr do«n 

proccess of selection results."^ ^ and 

examination a student m y 



86 RLSrARCII ON DEAMINATIONS IN INDH 

and Training to show how .0 wntc good ohjcclive-.y^,, qucdionr 
Despite these recommendations, so far there ha examinations 

towards the introduction or objcctise-type tests 
At the University level they have not been given even a fair trail 


3 IMPROVEMENT IN TIIE ACCURACY OF MARKING 

Taylor (1963 a) George (1964) and Harper et al <' 9 ^> [°”” t 
that examiners have a tendency to boost up borderline case [tn , 

higher division, which Taylor (1963 a) calls J-elfect They 
the J effect introduces further errors tn marking and, thercior , i ‘ ’ 

the need of impressing upon the examiners not to boost up o 


marks , , _ 

Taylor, Tluanga and Misra explored the factors that lower m 
reliability of examiners Taylor and Tluanga (1965) found that 1 * 
a positive persistence effect, that is the following mark is almost a w ^ 
attracted towards the preceding marks in marking The 0CCU J a !’ C > C t j 1 ; 
negative persistence, that is, the following mark being repelled ) * 

preceding marks, is not disprosed, but if it occurs it must be very 
Taylor, Tluanga and Misra (1966 a) found that almost half of cxam» n * * 
are subject to large variations of standards in the course of their i* 13 * 
ing The study showed the existence of a diurnal fluctuation in 
standards of marking Taylor, Tluanga and Misra (1966 b), found t 
examiners’ markings arc much more accurate early in the process a 
some twenty five scripts have been marked It becomes most inaccura 
about two thirds of the way through Strict examiners were found to 
more accurate in marking Tendency to mark systematically lo' v 
higher was found to be a constant characteristic of an examiner The 
tendency to mark on a compressed or wider scale was found to be another 
constant characteristic To take care of the above factors in marking-* 
Misra (1968 b) recommended the provision of model answers to examiner* 
elaborate instructions for marking, and use of question wise markmS 
rather than script-wise marking Harper (1963 c) showed statistically 
that question wise marking will reduce errors in marks Gayen et a 
(1961) recommended the provision of model answers and elaborate 
instructions for marking, and giving more time to examiners or fewer 
numbers of scripts to them The S S C Board, Maharashtra (1963 w 
recommended training of examiners, and increase in time allowed f° r 
marking and in the number of scripts re valued by the Head Examine* 9 
In certain universities two examiners ate appointed to mark th- 
same scripts independently When the marks given by two examine 
er widely a third examiner is appointed, and the average of tbs 
two marks which are closest are given to the student Taylor and Tiuaflg 3 
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0964), « a statistical study. found that th. >««•££ 
which ate closest may he mom ^us ton the av B m 

original marks that difle "'^'L^xtiac^ inslead of 
the accuracy of marking wh a scnpt 1S referred to a 

two is only nominal They recomm should be given to the 

sup C ^ZSom. - L « l " a,h3 " (1570) re ” mm " ,duscor 

grading in place of marking 

4 SCALING OF MARKS m m „k m g in 

Some Indian studies have exp o s6estc d melhods of scaling 
different papers of an ^ c0 parable Mahalanobis and 

to make achievements ,n diflenm P0P« ^ace ,„e distribution 

Chakravarty (1934) found 0 , Leaving Certifieale exam na- 

of marks in various pap=« ”f « h S h : „dej the use of seal, ns 

non of 1919 nf th ' UP „ B “ , he «me mean and dispersion by ‘he 

different papers They k J ^ , MSt ror tho.e whose « of scl |, ng 

students this should , c ) recommends . dir j, of 

included m the men 1 . d by examiners differing ap pU 

10 « or 5S?ittS b ) 1 %. 

SH VS % •- /r mtkf Har,r 

pare d a '„f nan linear scaling by that the 

“uons'and fa,, ^ m 

method of scaling as reduce *£«» p 

pass percentage in « (|96S b) showed t > of Biology art 

scaling marks to the same mean 
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sisss=i^S=i?-55 

an empirical stud), Ta) lor. Tluanea and Mam (1966 b) I 
Taster's (1963 a) method of scaling reduces error or marking. It W 
obsened that despite so man) recommendations put forth J* L lv 
experts on examination for scaling marts, no public -min lmn bo . 
in India except the Gauhati Unixersity employs any kind ofscalms 
its examinations. 


5. GRACE MARKS 

In almost eserj- Indian examination some grace marks are 
gisen to borderline cases so as to enable them to pass the examine i • 
The award of grace marks is often arbitrary. On statistical “ n3 1 
Taylor and Tluanga (1963 a) found that such a procedure is unscienti t . 
They recommended the use of “passing probabilitj” instead ofawar i 5 
grace marks. Those who arc abose a specific passing probability snou 
be allowed to pass. Assuming standard error of marking to be 5 ,» a 
7%, they prepared charts giving calculated values of passing proba » * > 
for a gisen raw score. Bora (1963) obsened that the use of the passing 
probability charts should only be made after the standard error ° 
measurement in each paper has been determined by further research. 

6 CHANGE IN THE BASIS OF A WARDING “PASSES" 

Gajen et al. (1961) hold that as the difficulty level of a question 
cannot be determined before students’ performance on the question 
is known, the maximum marks allotted to each question should be gi' cn 
only as tentative in the question paper. The final fixation of maximum 
marks should be determined after the tabulation of students’ P$ r * 
formance, in a meeting of paper-setters and examiners. If this procedure 
is adopted, the chance factor in achievement due to selection of item* 
will be considerably reduced. They also recommend that the minimum 
marks for getting a division should also be determined each year on th- 
basis of students’ actual performance that year. 

Hill (1964 b and 1967 c) recommended that for the students who 
would enter life after passing the secondary school examination, pus* 
should be the rule and failure an exception. He wanted that we should 
re-work our standard of pass, and a pass in the secondary examina- 
tion should not mean a ticket for admission to umversity. 

7. ABOLITION OT SUPPLEMENTARY EXAMINATIONS 

Taylor (1962 0, in a statistical study, found that the supplementary 
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prediction of further performance of students. Gayer, et 0962) round 
that the total of test examination and external examination marks 
more reliable than the external examination marks taken separate y. 

Hill (1965 1966 a and 1967 h) recommends making internal assess- 
ment systematic, and suggests that a student’s performance in internal 
assessment should also be reported along with his achievement i 
external examination. 


9. SEMESTER SYSTEM 

University Grants Commission (Ed.) (1969); Elehance (1970), 
Golay (1970); Desai (1970); Godbolc (1970); Shapeti ( 1970 ) recommend 
the introduction of the semester system. 


20. VIVA VOCE 

Ray (1970) recommends supplementary essay test with viva 'oce test. 


IV Conclusions Suggested by Indian Studies 
As is evident from the foregoing discussion, various problems of 
examinations have been covered in Indian studies. However, there is 
a pressing need for further research. To quote Dave (1966, p. 36): 

Although several major dimensions of the field of educational 

measurement and testing have been tackled these dimensions 

have not yet been exhaustively studied. Much more research work 
is necessary in practically every aspect of the field in order to create 
sufficient knowledge, develop new techniques and produce effective 
material which may ultimately result in the qualitative improvement 
of educational testing in India. 

None the less, the following broad conclusions may be drawn from 
Indian studies: 

1. There is high wastage and stagnation in Indian education. 

2. The standards, in terms of passes, of supposedly equivalent 
examinations differ considerably from one examining body to another 
in India. 


3. The standard or marking varies for different papers of an 
examination, for different examiners or the same paper, and for the 
same examiner on different occasions. 

4. None of the reliabdity figures reported in Indian studies reach 

dons are nTr^r aC f e ? tab,e for grading individuals. Thus, examina- 
tions are not fulSlmg their avowed purpose to “assess merit" accurately 
va i i y o essay-type tests is very low for predicting job 
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(1962 D 34) observed there ire other aims the attainment of «hich 
arc better evaluated by multiple choice questions, open book exam, 
tions short answer examinations inn mre tests, etc 
The Standards Committee of the University Grants 
(1965 p 29) observed The use of objective tests vvi n 
make internal assessment more reliable but will also ensure p . 
economy and efficiency 

The Education Commission (1966, p 291) recommended the 
duction of objectivity in our examinations They observed Ano ic 
important point of emphasis would be the reorientation or university 
teachers to adopt new and improved techniques of evaluation 

It is apparent from these recommendations that there is a £ rca 
awareness in India of the need for examination reform Only the desire 
for change has got to be felt thoroughly enough at the lower levels so 
that the change becomes possible 


Agencies Active in Exxmination Reform 

As a result of the foregoinc many agencies and educational bodies 
in India have become active in the field of examination reform Among 
the most important of these are the Department of Curriculum 
and Evaluation NCERT, National Institute of Education (NlE)t 
Evaluation Units of Secondar) Boards, Gujarat Research Society , 
Bureau of Educational Research, Ewing Christian College, Allahabad, 
Bureau of Psychology, U P the Kerala Baroda Mysore, Gauhati 
Universities, the Indian Statistical Institute (IS1) Ca'cutta' and the 
Regional Colleges of Education Of agencies working on public cxamina 
tions the Gauhati University among the universities and the Central 
Examination Unit are in the vanguard (Dave, 1966, p 10) 

Besides several important studies were made under the grant in 
aid schemes of the National Council of Educational Research and 
Training and the University Grants Commission Among the studies 
done under the grants in aid scheme* those done by Gayen of Kharagpur 
Harper of Allahabad, and Pillai of Kerah arc outstanding 

A brief report on the progress of examination reform in India i> 
contained in DEPSE (196"> b) and (1967 d) 

Publications Pertinent to Examination Reform 

Publications in the field of educational research have appeared 
in great number in recent vears especially from the work of the above 
mentioned groups A few very important ones dealing specifically with 
problems of examination reform are mentioned below 
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Examination Reforms in Maharaja Sayajirao University of Baroda, 
published by the M. S. University, Baroda. 

Considerable research on examinations has been done, yet much 
more is needed to provide essential knowledge about educational 
measurement. Our knowledge of the human mind is far from satisfactory. 
What knowledge, abilities and skills are of most worth and how to 
measure them best are the questions which have not been . answered 
conclusively so far. 

As regards examination research in India, it is still in its infancy 
stage. Probably the Examination Research Unit of the Gauhati Univer* 
sity is now the only agency solely devoted to research on examinations. 
There is no periodical exclusively devoted to educational measurement. 
It is hoped that with the advance of time examination research will 
gather momentum in this country. 

Besides, examination research in India suffers from various handi- 
caps. Many of our universities do not train their students in research 
methodology. The result is that sometimes research studies have poor 
experimental design, sometimes assumptions not warranted by the 
data are made, sometimes some essential information is not reported 
which makes it difficult to draw an independent conclusion, and some- 
times wrong inferences are drawn. We do not yet have even an agreed 
pattern of research reporting. Much research work done has not been 
published. A huge number of dissertations submitted for Master’s or 
Doctor’s degree in Psychology and Education is on educational measure- 
ment. A great majority of them are on test standardisation, roost of 
which have no further utility than the award of a degree. 

There are certain problems which can be handled by some central 
agency. Standardisation and development of national norms of intelli- 
gence and achievement tests is overdue now. The determination of 
objectives of university education is observable behaviour, construction 
of admission tests to various courses, and exploration of validities of 
various courses are the areas where a central agency could be of immense 
help. It is unfortunate that Indian universities take approximately 
three months’ time for the publication of results. A central agency 
could help universities to speed up the tabulation of results by modern 
methods. 

Despite the need for more research and improvement in the 
standards of research, it may be stated that enough research on examina- 
tions has been done to lead us in a definite direction Poor reliability and 
validity of essay-type tests is now established beyond any reasonable 
doubt. There is a need to try methods supposed to be better ones, at 
least on an experimental basis. When the results seem encouraging we 
should adopt these methods in our public exammalions 
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Research by itself cannot reform examinations Research findings 
have to be disseminated to those who have the power to reform examma 
tions Researchers use a technical language, and rightly so because they 
address their co-workers, but such writings make little sense to people 
at large We need researchers who may write reports in laymen s language 
We need other talented people, who, though not actually engaged in 
research, may interpret the research findings to the general public through 
popular articles in newspapers and magazines 

Most of our teachers do not know what objective based teaching, 
learning and evaluation means We need to tram them in the modern 
concept of evaluation The need of improved evaluation is there at 
every level of education but it is dire at the university level because it 
is the university students who are the future leaders of our country, 
and to the extent that we are able to measure their competence, to the 
same extent we shall he able to get competent people to lead and govern 
us 
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T HE GENERAL purposes of the two experiments have already 
been described in Chapters III and IV It is hoped that the reader 
has already read those, to get an overview of the projects It is 
now time to give the more technical details 

In this book the descriptions of the “ninety marking ten’ expen 
ment have preceded the “four thousand re examined” project This 
is because the former is perhaps more striking and easy to understand 
How ever, it is easier to describe the data gathered for the latter study 
first, as it was the more extensive The few ways in which the data of the 
“ninety marking ten” project differed will then be given 

A Four Thousand Re examined 

1 Examinations selected 

The original purpose of the study was to answer two related, though 
somewhat different, questions “How reliably can an examiner re mark 
answer books previously marked by himself? * and ‘ How reliably can 
an examiner re mark answer books previously marked by another 
examiner ? 1 (As we shall see, new problems arose out of the data itself, 
so that ultimately a good deal more than just these two questions was 
answered ) 

There are two possible approaches to this sort of study The first 
is to set up a special examination experiment, asking students to answer 
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a question paper, and having each answer book marked by two examiners 
The second is to use the results of an actual examination administered 
under the usual conditions 

One big advantage of the first approach is that two completely 
independent markings are possible i e the first examiner can be instructed 
to make no marks in the answer books However, the type of answers 
that students may write in such a situation may be different from ihose 
written in a regular examination More important examiners under 
such circumstances may act differently from the way they usually 
do, and thus the results may not be believed to apply to the routine 
situation Nevertheless the first approach has considerable merit and, 
after finishing this study, the junior author went on to carry out a study 
along those lines In the present study, however, actual examination 
answer books were used 

The generous cooperation of a Board of Secondary Education 
was obtained The Board which provided the basic data has asked to 
remain anonymous on the very reasonable grounds that the results of this 
study are probably applicable to any Board, and thus it would be un- 
fortunate if any particular Board (rather than the whole examination 
system itself) were stigmatized because of these findings (Those who 
ma\ a f read) know the identlt ) of this Board are asked to respect this 
request for anonymity ) 

The Class X Final Examination was chosen as it is one of the 
most widely administered and crucial general examinations in the entire 
educational system 

Four subjects were chosen History, Hindi, biology mathematics 
These were chosen as fairly representative of the total system. There 
are two Arts exams, one of them social studiej and one language There 
are two Science exams one involving considerable judgement in marking 
and the other a supposedly more exact science For simplicity, only 
one paper, generally the First Paper, was chosen to represent each subject 
In mathematics the Second Paper was selected as this was the geometry 
paper which was considered to involve slightly less mechanical marking 
than the First Paper Each of the four was a three-hour paper The 
Hindi paper earned 34 marks, the others SO marks each 

2 Examiners selected 

The Secretary of the Board was asked to provide answer books 
from eleven experienced examiners in each subject, le the work of 
a total of 44 examiners The examiners »n each subject were selected 
from among those who had worked under the same eputy ea 
examiner, and Ihus could be expected to be experienced in the seme 
method and standard of marking 
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3 Experimental design 

One thousand answer books m each subject were selected 
exanuner te examtned 50 answer books originally examined by lumseir 
and 50 originally examined by another examiner The exa ™ n ' r " h 
“formed as to which were originally marked by himself, and which 
originally by another He was informed that some of them, 
originally marked by yourself, and some by another examiner . but 
not how many scripts were of each type , 

The selection and redistribution or answer books was 
according to the following design 


Scripts originally marked by 
B C D E F G H I J 


A 50 50 
B 50 50 


Number 

originally 

marked 

by each 100 100 100 100 50 50 50 50 50 50 


Total 

X re-cxammed 
by each 

ICO 

100 

100 

100 

50 100 

50 100 

50 100 

50 100 

*>0 100 

<0 100 


300 1000 


Thus each examiner re marked 50 answer books originally marked 
by himself and 50 originally marked by another examiner The first 
two examiners (A and B) were paired, each re examining 50 of his own 
scripts and 50 of the other’s The second two examiners (C and D) 
were likewise paired The remaining six examiners re examined their 
own scripts and those of the eleventh original examiner. Examiner X 
Examiner X was specially chosen as one who had had considerable 
experience in marking (It was originally specified that Examiner X 
should also be one “whose marking has conformed closely to the normal 
pattern ” Without previous knowledge of the means and standard devia- 
tions of the marks this specification was impossible to apply As it 
turned out, three examiners did conform fairly well to this requirement, 
but the mathematics Examiner X’s mean was considerably above average 
for his subject ) 
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4 Selection of answer books 

The Board arranged to have answer books drawn from their stacks, 
before the 1964 examination booklets were destrojed From among 
these, the Research Assistants selected the requisite number to fulfill 
our experimental design 

The plan stated that the scripts of each examiner will be so selected 
that the}* include a reasonable range of his marks ' At the Board s 
offices the answer books are kept to large cloth bags in huge stacks 
m the store rooms Though specific books could be found with much 
searching, apparently, they could hardly be considered to be m ‘ apple 
pie order” Thus a strictly random sampling would have required an 
enormous amount of searching and other work, an amount which did 
not seem justified under the present circumstances Were this a study of 
the statistics of examinations (as, for example were Gayen s 1961 and 
1962), then a strictly random sample would be needed But since this 
is a study of re examining, it is only necessary to be reasonably certain 
that the selection of answer books was not done in such a way as to bias 
the results of the re-marking experiment 

In general, the Research Assistants selected the first answer books 
that they could find, of each of the 44 examiners Since the answer books 
were not stored in any particular order, there is no reason to believe 
that the 4 first found * are not essentially a random sample 

Is it possible to test the representativeness of this sample 1 Un 
fortunately, the data required for such a test are just not available The 
Board compiles no statistics except “pass percentage * But even this 
is of very limited use to us, because (a) it is based on all candidates, 
while we limited our study to regular (disregarding private) candidates, 
(b) it is based on the total subject , and no separate “pass percentages" 
arc available for the separate papers, and (c) something like 7° 0 of the 
Board’s candidates pass by “grace marks”, which are not included in 
our study 

With all these reservations, let us examine such pass percentage 
data as we have available 


History Hindi Biology Mathematics 


Board (all candidates) 75 0 % 

Our jludy (regular candidates only) 74-5% 
Difference* —03% 


770% 754% 

709% «S2% 
—0 1 % — 12 8 % 


646% 

«>7% 

— 39 % 


As we have noted, the Board's pass percentage is based on the sum 
of two (and in Hindi three) papers, "brie ours is based on mb' ° 
the papers, in each subject If the maths of two correlated papers 
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summed, and the pass percentage is the same for each of the papers, 
the pass percentage on the sum will be higher than the pass percentage 
for the individual papers This follows from the formula for the standard 
deviation of a sum 

~ V<V +• + 2 r « a » 

If 7-^=1 00, then the standard deviation of the sum is twice that 
ot each individual paper But if r is less than 1 00 then the standard 
deviation is smaller than twice the original standard deviations Since 
the standard deviation has thus shrunk, but the mean of the sum is 
always twice the original means, then the standard score distance of the 
pass mark from the mean is greater for the sum than for the original 
two papers — and, therefore, the fail percentage is lower and the pass 
percentage higher 

Thus our pass percentages should be lower than those of the Board — 
which they are in three of the subjects The fact that the difference ts 
not greater probably reflects the inclusion of private candidates in the 
Board’s statistics As for biology, this may mean that our sample is 
rather better than the average Or it may mean that Paper I was much 
easier than paper 2 It is equally possible, however, that our results were 
biased by the individual characteristics of the particular Deputy Head 
Examiner under whom our examiners had worked This hypothesis is 
given some slight credence by the fact that the mean marks droppes 
more in the second marking in biology than in any other subject, and 
that the mean marks of the “specially experienced Examiner X” were 
lower than the means of the ten original examiners Again, however, 
the Board did not provide us with any statistics as to how those who 
worked under one Deputy Head Examiner differ from those who worked 
under another, so no more exact comparison can be made 

What evidence we have supports the conclusion that there is no 
reason to believe that our sample is not adequately representative of the 
total population from which it was drawn, for the purposes of a study 
of this type 

5, Preparation of answer books 

It was at this point that we ran into the first of several unforeseen 
and very difficult problems This one was so serious that its solution added 
more than two months to the original time schedule of the project 
The problem was ihe ■•disguising" of both the identities of the candidate 
and the original examiner, and the marks of the original examiner 

It was fairly simple to design and prim a new cover, so the old cover 
could be removed torn each booklet before sending it out This removed 
the primary source of identification of examiner candidate, and marks 
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when booklets were arranged in strictly numerical order for * 

tins arrangement efrectively mixed up the "same and different examiner 

booklets in an essentially random pattern 

The next step was to stamp the proper code number on both the 
old and the new enter of each booklet It was also stamped on he first 
page of the booklet, as a protection against possible detachment or the 
cover The old cover was then removed and the fresh blank cover 
stapled on New covers were also added to any "B ’ booklets l e 
‘disguising” of the answer books was complete 


7 Despatch and receipt 

In the meantime, the Secretary of the Board had provided us with 
addresses of the examiners, and also question papers, rules and instruc- 
tions to examiners, supplementary instructions of Head Examiners, etc 
We had written to the 40 examiners, enclosing repl> post cards, and 
had received assurance that they would cooperate with the stud> If 
any examiner did not reply within a week, a reminder was sent All 
this was done before much work was done on the answer books Arrange- 
ments had been made to obtain other batches of answer books in case 
any examiner refused Fortunately, this was not necessary 

As soon as it was ready, each batch of answer books was despatched 
by railway parcel (We had considered parcel post, but found it subs- 
tantially more expensive) By registered post we sent the RR Bill, 
question paper, the Boards instructions (which we reprinted, with the 
Head Examiner s comments added), and a bill form for remuneration 
and expenses Examiners were to be paid at the regular Board rates 
We also enclosed a money order form asking the examiner to fill m his 
correct name and address — thus saving us not only extra work, but also 
possible error 

Any examiner not returning the answer books in time received a 
reminder, but no deduction was made in remuneration 

As soon as the marked answer books were returned, the data were 
prepared for card punching Each cover was removed, and matched 
up with the old cover The new cover was stapled onto the old cover 
in such a way that both sets of marks were visible, side by side, but the 
original roll number and other information about the candidate was 
completely covered This, again, was to preserve anonymity At the same 
time the number of pages answered in each booklet was also counted, 
and recorded on the cover 

8 Card punching 

Arrangements had been made to punch the data on I B M cards at 
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, month ,o complete the computet work, which wc had expected to be 
done in only a few days. . , 

Furthermore, this particular computer has a rather limited memory. 
The complete analjsis of the results of each mdiudua ^“Tbull- or 
within this capacity. But the summary charts which form the ‘ 
this report all had to be tabulated by hand. This again added enorm y 
,o both the time and effort required, far beyond our ongmal ■ 

Most of this has had to be done by the senior author himself, without 
help, since his return from the U.SA. 

Howeter, it is also true that the project might hate taken yeais, 
rather than months, without the computer. The 240 J 

correlation alone would base taken seseral months, not to *«k of 
means standard deviations, scatter charts, and various frequency distri- 
butions prepared for each of the 80 sets of 50 answer books. 

Each set of calculations was done by hand for one or two sets an ^ 
checked against computer results, to check on any systematic niachm. 
errors due to programming error. Further, the results were scann “ ’ 
and any \ery unusual results double-checked. (Some punching err 
were found— despite the fact that the cards were supposed to has e bee 
\erified.) The computer was programmed to calculate up to four decima 
places, rather than just dropping digits beyond the fourth place. 

Appendix E lists the information printed out for each set of 50 
answer books, not all of which is reported or e\en used in this report. 
Note, especially, that sums and sums of squares are available for analysis 
of variance which (for lack of time) has not been done. 


10. Formulas used 

Standard formulas for mean, standard deviation, and product- 
moment coefficient of correlation were used. N, and not N-1, w’as used 
in the denominator of the a formula. In calculating t for the difference 
between means, the formula for correlated means was used, and correla- 
tion between the two examiners was used as an estimate of the correla- 
tion between the means. 

It was discos ered (after all the calculations were finished ) that 
the t-test for difference between correlated standard deviations does 
not apply when N’s are less than 100. The F-test, of course, does not take 
account of correlation. Dr. C. R. Rao, Professor and Head of the 
Research and Training School, Indian Statistical Institute, Calcutta 
kindly provided us with the following significance test for this 
situation. 

j ^ 

' = [<S„ + S„ + 2 S„)"(S„ " s „-2 S„)j 1/Z - 
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1. Choice of method of reproduction 
There were several ways in 

answer book eould have been made. The 2 * * S ™P ’ ' or printed. 

have been to have had them typed and mtmeographea P 

Tavlor et al. (1966) used this method, taking every ca P 

were duphcmedf ami wDrd^writte^fan^ranreller^by'th^^mdidatesjyate 

difficult where there are drawings or mathematical s ymb. <»>*•£, 
reproduced. More important, the factor of handwriting 
altogether- and the fact that typing or printing is so much “ s ' 
read, may influence the marks. A second method is to have som 
copy the answer book by hand, onto mimeograph stencils or spirit 
offset masters. This, also, is relatively inexpensive, and may prod 
multiple copies closer to the original. But unless one has several ? ' 
in fact a different one for each answer book, the factor of handwnti g 
differences is again lost. And, in any case, copying will inevitably - 
neater than the original script, which is written in haste and uno 
the pressure and anxiety of examination conditions. . 

The only answer to all these objections is to reproduce the original 
answer book photographically. At the time that this study was initiated, 
this was easier said than done. Photo-offset was a new process m India, 
therefore hard to obtain, and prohibitively expensive. Multigraph 
(both Multihth and Rotaprint) were available, but again the photographic 
process was far beyond the budget of this research. Thermofax was 
available for inexpensive production of masters, but this required scripts 
written in pencil or with a carbon-content ink, so it also was ruled out. 
For a while, in fact, it looked as though it might actually be cheaper 
to have the material reproduced abroad. But after several weeks of investi- 
gation, we found that Xerography was being introduced into India. 
The Xerox process is among the least expensive methods of producing 
facsimile multigraph or offset masters. Messrs. Gaumont Kalee Ltd. 
of Bombay were importing a Xerox machine capable of producing 
Rotapnnt masters, and they already had a Rotaprint on which to make 
the prints. They agreed to undertake our work, as soon as the Xerox 
arrived. 


2. Choice of examinations 

The original intention was to have ten answer books re-marked 

by one hundred experienced examiners, in each of the four subjects 

used in the “four thousand re-exammed” project. But both the tune 
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.«* nnd .he budge. of WJ^SSSS method^ 
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S,»«dTno"Xon r d the study of the fust examination se.eeted 
Tor reproduction i e history 
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could be eliminated by using a red filter in 

future ^nvestigators^erus 1 ^!^ this note : ca'n'be'used 

?£££ - — " »fr te h a rs^o 55? 

offset process.) Thus we vvere left with no in the 
^uf eta" — sent y to the participant in that study 
indicated that the disguising was reasonably successfu , 

assume the same in this study* /virr<*ta* 

Of course, we probably were not completely successful. Th 
tion « XZL the mean marks awarded by the 50 «arn,ners and he 
number of prev.ous markings in each answer book (including thn b g 

ones we added) was +.41, while the correlation of ‘ he mar ^ “ e arc 
by the original examiners with these markings was +-». Smce we ^ 
interested in these groups themselves, and not in any statistical ml 
we may consider this difference significant for these data. It g 
to assume that there would be some relationship. The + .29 sugg 
that the random addition of markings may have reduced the or grnai 
correlation, while the + .41 suggests some (unconscious?) mlluenc 

on the 90 examiners. f . , . „„ v ;_ us 

Still we do not really know the extent to which these previous 
markings determined the results of this study. This is the price .we haa 
to pay for having a ‘‘realistic” (rather than an artificially-produceo; 
set of answer books, in the candidates’ own hand-writing, with errors 
corrected, mistakes, etc. It is obvious, however, that whatever influence 
these disguised previous markings may have had would have been to 
increase the degree of agreement among examiners. We may, therefore, 
interpret the following results as maximum rather than average or 
minimum inter-examiner agreement figures. 

A code number was written on each page of each answer book, 
before sending to Bombay. Using these code numbers, we collated the 
sheets sent us from Bombay into test booklets. The reproduction took 
more than three times as long as the company (which had had no previous 
experience with this process) originally estimated, making it impossible 
for us to finish the project on our original schedule. 


5. Experimental design 

Dr. H. J. Taylor has drawn attention to what he calls the “persistence 
effect” in the marking of examination papers. We were, therefore, 
interested in whether the serial order of marking would influence the 
marks given individual answer books. Ideally, the papers should have 
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been marked in a Afferent order by each examiner, but the adramistra 
live complexity of ensuring this, and the clerical complexity of tabulation 
results ruled this out Thus we decided on using two orders only, one 
the reverse of the other The ten answer books were arranged m random 
order, then numbered 10-A, 9-B, etc through I J Two types of marks 
sheets were prepared — in one of them, the books were listed in the order 
"A* B. C, J \ in the other the books were listed in the order 
*T-J, 2-1, 3-H, 10-A ” 'Hie first order was sent to the first 25 

examiners on our list, the second order to the second 25, the first order 
to the third 25, and the second order to the last 25 examiners Since the 
names of the examiners were presumably in no particular order, this 
was assumed to randomize the procedure adequately Examiners were 
asked to mark the books in the order received 


6 Despatch and receipt 

At our request, the Secretary of the Board of Secondary Education 
had provided us with a selected list of 100 experienced examiners in 
Class X History As in the four thousand re marked experiment, each 
examiner was supplied with the question paper (only those questions 
actually answered), the Board's and Head Examiner's regular instructions, 
and the supplementary intructioos of the deputy head examcaer A 
typical instruction for one of the questions ‘4 marks for social 
condition, 3 marks for political condition and 3 marks for economic 
condition ” No model answers were supplied The instructions were the 
identical instructions which had been used for the regular examination 
All the examiners had marked this paper in the regular examination 
Therefore we may presume (though we have no proof, of course) that 
their “standard or marking” for these 10 answer books was approximately 
the same as for the original examination 

To save both time and expense, no attempt was made to contact 
examiners ahead of time A letter explaining the project and requesting 
cooperation was sent along with the answer books, by Registered Past 
Each examiner was offered an honorarium of Rs 10/- for marking and 
returning the examination books Under these circumstances a 90% 
return seems reasonably good, considering that examiners may have 
moved, been ill, or otherwise disinclined to cooperate Repeated attempts 
to trace them and get back the answer books were made, of course, but 
ultimately the marks sheets of the remaining ten could not be obtained 


7 Reliability of the data 

However the fact that only 90 of the 100 exatn, nets returned them 



RESEARCH on examinations IN INDIA 
marks sheets does raise the question! How serious is the fact that 10 
d ' d risTnu^r our design B ave ius two W «££ 

examiners-those to whom the answer books were m 

order, and those to whom they were sent in lhc s ^°"f second It 

two of the first group returned their papers, and 48 of ^e $ econ 
is reasonable to assume that the 48 are representative of the 50 n th 
second group. If the 42 in the first group can be shown to be sim tor to 
those in the second group, it seems reasonable to assume that t 
are an unbiased sample of the original 50 in the first group, and that the 
90 examiners therefore fairly represent the original 100. 

First, the significance of the difference between the means ol tne 
marks awarded by the two sets of examiners to the sametcnanswe 
books was examined. The difference (0.31 marks out of 50) divided by 
its standard error was .70, which certainly docs not suggest any r 


dinerence. . • 

For a more exact test of the distribution as well as the means, cn 
square was applied. The total frequency with which each mark (ranging 
from 2 to 38) appeared was tabulated separately for each of the 
two samples of examiners. Small frequencies were combined. The two 
distributions were then compared. They yielded a chi-square of 16.58553. 
For 26 degrees of freedom, the probability of this chi-square appearing 
in two samples from the same population is between .90— .95. For chi- 
square to reach even the 5% level of significance, it would have had 
to exceed 38.885. 

Both these arguments depend, of course, on the fact that the answer 
books were distributed in a random order. Any interaction effects between 
order of examining and marks would thus tend to be averaged out. 
Clearly, these two samples are identical. They are both very near to 
complete, and the missing 10% of examiners would not have affected 
results significantly. 
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The Concept of " Reliability ” as Applied 
to Traditional Examinations 

C ONFUSION OF terminology, in any science can lead to confusion 
of thinking and of interpretation— and this we fear is what has 
sometimes happened in discussing * reliability (See Review of the 
Literature Chapter V) The APA AERA NCME Standards (1966) have 
attempted to clarify the various meanings of reliability primarily as 
the term relates to objective tests and examinations We have not found 
however any clear and complete treatment of the subject as it relates to 
traditional essay type examinations or to the comparison of the two 
types For example when looked at from the standpoint of essay type 
examinations the reliability of an objective test is really two different 
reliabilities represented by numerically identical coefficients 

We are indebted to Warren Fmd/ey who after reviewing an early 
draft of this study, first suggested (in a personal communication ) this 
need for terminological clarification and who suggested some of the 
concepts needed And as the knowledgeable reader will soon find >ve 
have drawn heavily on the work of Harold Gulliksen (1950) and of HJ 
Taylor However we have tried to adapt these concepts to traditional 
Indian examinations terminology also— eg marks instead of 
scores , and examiner instead of reader 

First let us define a basic designation traditional examinations 
Although we have also used the more usual term essay type examma 
tions we feel that traditional may be a more felicitous term By this 
we mean questions requiring long answers worked out in detail in a 
page or more A mathematics question requiring several pages of algebraic 
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transformations or calculations, but no words, is hardly an “essay"; 

“ i* 

particular examiner's scale is most simply defined m terms -at W stalls 
tics, the mean and the standard deviation ofthe marks heawardsto a 
randomly selected hatch of answer hooks. The relationship b 
candidate's marks on the scales of any two examiners marking the sam 
batch of answer books is given by the formula : 


M, - X, _ M , - X, (7.1) 

u, o„ 

where X, and X, are the marks awarded the same answer book by 
examiners x and y, M, and M. are the means of the marks, and a, ano 
„ are the standard deviations of the marks awarded by examiners x 
and y respectively. (Mahalanobis, 1934 takes into account other statis- 
tical characteristics also , but the ones we have noted are sufficient foi 
most ordinary purposes.) Once the constants have been calculated tot 
the particular examiners (or for an examiner and an arbitrary scale), 
this reduces to the form X, = X, + b, which can be quickly calculated 
for each individual candidate. But it should be noted in passing that this 
formula is relevant only when marks are normally distributed. Other 
scaling methods, more appropriate to marks in India, will be discussed 
in a separate chapter. 


I. “ Examiner ”, “content", and “marks" reliability 

Now let us start -with two parallel traditional-type examinations. 
Form A and Form B. Let us assume two examiners, or readers, for each: 
Examiners 1 and 2 for Form A, and Examiners 3 and 4 for Form B. 
In this situation, we find three distinct uses of the term “reliability”. 

Examiner reliability : This is the correlation between the marks 
given by two separate examiners to the same set of answer books. 
(American texts call this “reader reliability”). Too often, as has been 
pointed out in the Review of the literature (Chapter V), researchers have 
just called this “reliability”. This has then led them to the erroneous 
assumption that, if this coefficient can be raised to .85 or .90, “essay 
examinations are just as reliable as objective examinations”. (See 
Basumalhk’s, 1959 review for examples.) Note that in objective examina- 
tions, “examiner reliability” (or “reader reliability”) is 1.00. 
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reliability here One js when 


whl -t, u. .ip „ ® xaminer re-examines answer books 
which he hmsdf or/gmally marked Let us caff this simply the ••self- 
consistency of the examiner The other type i S when the answer books 
are re-marked by a different examiner This may be called “mter- 
exammer consistency” In various contexts in this book, these two are 
distinguished by the words “Same” and ’‘Different” (or just “S” and 
”D”), and “Self” and “Other” If the number of scripts is large and/or 
the lime interval long, memory ts a negligible f 3 ctor m examiner self- 
consistency Where the two forms of examiner reliability differ, then 
such factors as different “expectations" or “styles of marking” arc 
probably operating 

Content reliability This term was introduced by Gullikscn in 1936 
(see GuUiksen, 1950) It might as easily have been called examination 
reliability or jest rebaiobty, and )s the only type of iradiuomUtype 
examination reliability which is strictly parallel m concept to the reli- 
ability of an objective examination (“In concept ’ js italicized because, 
as will be seen later, it is more accurate in practice to compare ‘marks” 
and not “content” reliability of essay-type and objective examinations ) 
This is the reliability of the question paper itself, quite apart from the 
reliability of the examiners Statistically, it is the reliability of the test 
corrected for the attenuation due to examiner unreliability 

Marks reliability This is the reliability of the total examination 
situation, taking into account both “examiner ’ and “content” sources 
of unreliability In fact we have previously called this “total reliability" 
(Harper 1967, Misra J968) But we have recently seen the term ‘score 
reliability” used (Coffman, 1966), and this suggested that “marks reli- 
ability" would be a more descriptive term for use in India 

Note that the same coefficient represents both the content reliability 
and marks reliability of objective examinations But for traditional 
examinations, the marks reliability is the content reliability attenuated 
by examiner unreliability. Thus in the practical situation, it is at the 
level of marks reliability that traditional and objective test reliabilities 
should be compared 

2 ** Correct mark" and "true mark ” 

It would seem wise to restrict the use of the term “true mark" to a 
meaning which is exactly parallel to the psyebometne term ‘ true score ’ 

The “true mark" would then be the mark which the candidate would 
have received from a perfectly reliable exam.nation, marked by a per- 
fectly reliable examiner When «e are talking only about examiner 
reliability, and not about examination or content reliability, it would 
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s«m vvise, lo adopt °t ht rt, h ?anl"boI“holC:crc‘Jed 
“correct mark is the mark w ic , , b perfectly reliable 

'=335E£S*4SS 

3 Standard error “of marking" “of content" and “of marks" 

The standard error of any examination ,s (theoretically) *‘ a £ a ™ 
deviation of the differences between “true- and ° bla '" ed " j , 
scores Thorndike (1949) calls this an “absolute mea ore ot re abil y 
as unlike the coefficient of reliabitily (a “relative measure), it is gen 
rally unaffected by differences in group variability For this 
•■more meaningful if a descriptive statement or the aceuracy of measure 
ment is desired” (Thorndike, 1949, p 69) noei „ 27) for 

Standard error of mark, ng This is Taylor s term (1964, p 27) tor 
the standard deviation of the differences between tlm mark aclua y 
awarded by a particular examiner, and the “correct mark which s 
have been awarded if his examining were perfectly rel.abb It tens 
us the extent to which a particular ex ammee's mark may differ from me 
mark which he would ha\e received if examiner reliability were 1 w 
(Note that for objective examinations the standard error of marking 

Standard error of content For the objective examination, this is 
the traditional standard erro r of measurement or SEM For traditional- 
type examinations, this would be the standard deviation of the differences 
between the marks which would have been received by the examinees 
if the examiners were perfectly reliable but the content was not, and those 
that would have been received if both were perfectly reliable This does 
not seem to be a very useful concept 

Standard error of marks This is the concept which, for traditional 
type examinations, carries exactly the same meaning as the SEM doea for 
•Taylor et al (1966 b) have used the term 4 true mark ’ for this concept Had they 
differenUated between the three types or reliability, and between * true ’ and correct ^ 
mark, they might not ha\e been led to be quite so enthusiastic about the ‘reliability 
or essay type examinations as their paper suggests 

••Taylor et al (1966 b) have used both the terms standard error of marking (pag® 3) 
and standard error of estimation (page 6) While the second term is acceptable the 
first seems preferable as it is unambiguous and carries instant meaning “Standard 
error of estimation” may be confused by some with standard error of estimate , 
which is a statistically different concept, with r* rather than r in a formula which 
otherwise is identical 
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questioning all sorts of people ' “ ‘^° e ^s lo the time when 
concept of content reliability does not *em to suirerf«m<his 
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unsealed marks as they are used in India, is a lower pracbcal level 
reliability than the coefficients might seem to indicate 
This leads us to a further differentiation in our term 
vse are dealing with what Thorndike (1949) calls “relative measures 
of reliability — that is, the various coefficients-differences mm » 
standard deviation have no effect. However, when we 
“absolute” measures— the various errors of measurement 
differences make the actual unreliability of marks much large 
otherwise might be. To avoid misinterpretation, let us add qual 
to some of the statistical terms that we have already discussed. 


5. Standard errors “(, scaled )” and "(unsealed)” 

The standard error of marking and the standard error of measur *\ 
ment that we have so far discussed apply only to the specific scale used 
by a particular examiner. They will tell us something about inter- 
examiner reliability only when both examiners are using the same scale. 
(Or, of course, when the scale used by each examiner is converted to 
some common standard.) Thus we must call this the “standard error 
of marking (scaled)’’. (A more accurate description might have been 
“standard error of marking (equivalent scales)” but this seemed un- 
necessarily cumbersome.) It should be obvious that, when examinations 
are not scaled to a common standard (e.g. those in this study) there is 
a separate standard error of marking (scaled) for each examiner. 

Note that when we talk of the standard error of marking (scaled) 
we do not necessarily imply that the examiner’s marks have been scaled 
to some common standard. They may or they may not have been. All 
we are stating is that this is the standard error of marking for the scale 
used by this particular examiner , whatever it may be. Where marks are 
unsealed, this means that there is a separate standard error of marking 
(scaled) for each examiner involved. 

But when the marks of all examiners are not scaled to the same 
standard, a list of the standard errors of marking (scaled) of each can 



119 


THE CONCEPT OF "RELIABILITY 

, ra,, £,r:' 

possible, for example, for , . ors 0 f marking (scaled) of 

examiner rehab, ltt, of *> “ d f“ d ” d '™ m , nc r pass and the other 
1 5 and 3 6 rcspcct.'ely-and jet h ^ sUndald error does no, 

fatl nil of the candidates In bctwe ,,„ lhc examiners’ scales 

reflect the errors arising from djffc,' ^ (o be the mean mark 
If we consider the 1 correct mo* r A lhe range of difference! 
of an infinite number of or _„ h ,ch we have called the 

m scales, then the tmd.tiorml^ndard em , ^ ^ ^ ^ ^ of 
standard error of marking ( frect raar ks , 

varntions between obtained and rrec^ ^ standard error of 
Thus we must introduce y marking (non-equivalent 

marking (unsealed) (A8^ s “ d vc b u too cumbersome) This is fe 

aA‘ , r-3»^ , =ss 

" esLoS 'ch'anse =™rs of -W « 

The difference « ‘"°g,ven the table m CtaP « « 

statistics are as follov-s 


= 222 
= 6 52 

= 267 


My - 

w * “ a f 6 ’f ’ 

•* “ 6 S3 

„ _ 19 , «•*»' 

_ difference betted 1 

vet, OS can be seen ^ Isn» ’*£Z!S*E» 

rence is larger than , ,h ■»“ sta „dar ^°\ Msak i), .« <* 
standard errors While tn of markios t 

are 191 end 2 67 the 2 62 
standard error of the ddrerencc. 

d Mean errors ‘ {scaled) cm 

. ,r,t cttS : 

sir^5STi^s55==S 
£ ££»; " rer 



RESEARCH ON EXAMINATIONS IN INDIA 

120 

(unsealed) will give us useful information. Otherwise, this ignoring of the 
eX, fut™fe differ in mean, as well as in standard 

deviation, we are faced with yet another problem 

tfrmvs the extent to which a student’s obtained mark may diner i 
his “correct mark” or “true mark" (depending on which stanr lard e 
you are talking about). But when two examiners differ in mean as 
standard deviation, then the mean of each examiner s marks also ddle 
from the mean “correct mark". Thus a constant ^st be added to or 
subtracted from) each examiner’s mark to reflect ' h ' s . dlffere "“; " h hc 
we have a large number of examiners, their average is taken to be n 
“correct mark". Where we have only a pair of examiners, we WO 
seem to have no choice but to take the average of the two as the b 
approximation that we have of the “correct mark". The mean corn 
mark", therefore, will be half-way between the means of the tw 
examiners— and each of them will differ from it by half or the difference 
between them. In our example, the difference between the two means 
is 14.1. Thus the “mean error (unsealed)” is 7.05. (The mean error 
(scaled), of course, is always zero.) 

We should also remember, of course, that this mean has its ow 
standard error, which is related to the formula for the standard error 
of a difference between means. More about this later. 


7. Mange of marking errors ( unsealed ) 

Can we put these various concepts together ? The constant error 
in each student’s mark is plus or minus the mean error (unsealed) in 
our example, plus or minus 7.05. But around each end of this range of 
constant error of 14.1, is a normal distribution of random errors, des- 
cribed by the standard error of marking (unsealed). This leads us to 
the interesting conclusion that for some students, a negative constant 
error (mean error of marking) may happen to be combined with a large 
positive random error In such cases (though they would be 

few) the error component in the unsealed marks may actually be less 
than it would have been if the marks were scaled. 

Obviously, the chances of the random component being positive 
arc exactly equal to the chances of its being negative. When we interpret 
the standard error of marking (scaled) we say that there are two chances 
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ou. of three « . 

SE„ h above or below his correct m ^ t0 ^ this There arc 

the ranee or marking errors {unsca ) partlCU lar students 

error of marking (unsealed) more t han tw0 examiners 

When (as is usually the ' e „ raU£ It may be reason 

insolscd then (he constant itself r a[kmg errors (unsealed) on 
able ,n such eases to !my pan of examiners This ts 

the maximum difference foun „[• ,h c exam nation 

the only realistic description of the haa a[ound tbe standard 

Finally the abose statemenui 1 h l ™ “ be made substituting 

error of marking Somcssha. ^" fo r the The s - 

the standard error of marks , tot ,|, t content reliabitu es 

non IS further complicated £ ,h ^ r lhcse differences are reflect 
«s»o examiners may also differ Ho»= 
the SEM (scaled) for each examiner 

« Wl™"™ 1W0 „, concepts may be summarised 

The apphcabtl t> of the 
as follows 


Means 


Std Dc« 


Concept 


Same 
D fTetent 
D fTcrcni 


tr su |S22sb— 

r-.rf--.s55-: 

India scaling means y „ non arc gc c e „ 

I the means equal standard dev, a. on *» » 

nays) larger ,han ' ff w d rection 
least a step m «"= 

Determination a id formulas Msur ement mod* J t B 

Table 7 . summanzes 

:lated cocffcients , mar Ks) are obtained ^ ^ follows — 

ixamincr coni'"' an . d ” Se ‘vanonsstandard^rorsm^^ 


Mult, Ptyln! d« r»” tab,t 
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-This as v,e have seen, is simply the product- 
Exammer reliability . this, as awarded the same 

moment correlation between the pa king twice, or by two 

pJthpr bv the same examiner marxing ♦ 

examiner reliability, though not determinable, is not really n 
Marks reliability can be calculated without it. reliability 

Content reliability. The best way to determine content 
is to use two examinations that are supposed to cover thesamecou^ 
(say the Board examinations of two different years). Bot 
examinations must then be administered to the same group of students 
and the marks correlated. This is very seldom;done. So far as w kno , 
the study of this type now being completed by the junior author (Misr , 
1970) is the only one of its type for Indian examinations. 

Gulhksen (1950) provides the formula for content reliability, unae 
the above experimental conditions. Simplifying his notation somewha , 
we may write: 


r _ r *B ._ (7.2) 

•tc . * 

V r « r *i 

where: r« is the content reliability of the examination 

r AB I s the marks reliability of the examination 
r u and are the two coefficients of examiner reliability- 
There are, of course, four estimates of r AB , "[depending on which 
pair of different-form examiners one selects. In practice, therefore r AB 
is taken as the mean of r^, r u , r^, and r at . 

In the present study, however, we did not have parallel examina- 
tions. For objective-type examinations, when parallel forms are not 
available (or cannot be administered), we have convenient substitutes, 
either the split-half or Kuder-Richardson formulas. These, in effect, 
treat each individual question as a separate examination, and then find 
out if they are homogeneous enough to consider the total score reliable 
This method has been adapated to essay-type examinations through 
the analysis of variance (Hoyt & Stunkard 1952, Ebel 1965. Ebel’s 
intra-class correlation is also discussed in Guilford, 1954, and Guilford, 
1956). Unfortunately, these formulas work only when all candidates 
answer the same questions. For the situation where a choice of questions 
is allowed, the senior author has had to develop his own techniques 
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(Harper, ,966 abstracted . App^duO » 

or Ebel’s intra-class correlatio ? 00ga gnenset of questions 

non that the examiner who allows ,a i choice _am ■*» w ^ of cqua] 
is thereby stating that he c °" s,d '' s _ £ ved by candi dates of 

difficulty, or at least that equal ml As ,v U te rece, ^ >nsmg 

equal ability We are ’’ U'ererore, j ^ We subtract error variance 

from differences in difficulty, as e dmdt by ,he vanance 

from the variance of the manes. The formulas are 

of the marks, to obtam our coefficient of mhaoint^ ^ ^ ^ 
slightly ddferent for various tK0 of four m Part 

or nine", “choose three out < > f ““ ana , It a „u known that 
Two ", etc , require slightly diff reduce re l, ability (See 

allowing a choice of questions , f n,, stu d y ) 

Harper 1962. DEPSE, 1963, ^W****^ narim g of essay type 
A reliability formula based on only a s^E ^ ^ , acloa ( 
exanunaUons cannot, of course, ,n marking. By 

values of the questions the 1 ”*'” 5 - ^ mark awarded the firet 

“halo effect" sve mean the tendency to ^1< „ However h 

question to influence the marking of , lh », there is Me 

h« content rel, ab.ht.es reported in h P P remarkable, since Indian 

or no “halo” effect m marking This -sthe^ ^ booki marking" 
examiners generally read t »« ' ques tion 1 m “ ll ans *' d u!d 

questions at once, rather thao mark.og q method 

then Question J in ah marking of individual goes 

seem to discourage completely administration 

,mnS ,„ Table 7.1 the content on parade, W- 

is called r» to distinguish ** " a ’s os there are are 

Note, also, that there are stt up, and dva0 «d 

there is a very "8“* ”” ,° g B done, for example, .nth a „ oa 
thoroughly trained ,n lts ( ot ,he CoUege i Entra" E ddfttWt 

Placement essay-type exammatious , 0 bc looking I ■ „ y 

Board of the US A), ^““'.^uaDy vanes 

types of "ideal" nnsw™*. ^ type examma 

Thus, the “contents of a fle of the subjects in® ^ ^ fof lh ; 

from examiner to examine 0 n ^nged from ", ucism s leveled 

reliabilities of the »me exa „r tbe tradU'°” a d „ proof 

ten examiners.) This, “ f 1 nation This study P 
by opponents of the essay >T ^animations, 

for their contention objective tests an . d The internal 

the parallel forms reliability notation-^" 

consistency coefficients ru 
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The reader can see from Table 7.1 how 
traditional formula are applied to the various cup. 


Standard error of marks (unsealed): This is the ml “ st ^ now 

the most needed statistic for application to examinations .as y 

administered and marked in India. But we have as yet found no w y 

which this can be calculated. „r total error is 

Range of total error : For scaled marks, the range of 
completely defined by the SE * (scaled) or the S EM. When m ‘ ^ 
unsealed, then two components enter m. One is the effect of the 
distributions of errors of the two examiners, which is relateo ^ _ 


31 errors ui uic tvru tAamiuwro, — -- nther 

differences in standard deviations of their individual scales. T e 


differences in stanoara aeviauoa* ui . . . _ 

is the constant effect of the differences between the means of their 


vidua] scales. _ . ^ 

Thus the total range of error in marking for a pair or examuu » 
found by combining the mean difference between them with the ah 
(unsealed) for the pair. Actually, the mean of the correct marks h 
assumed to be half-way between the means of the two examiners, 
the difference between their means must be divided by 2. This, then, give 
the formula which you will find in Table 7.1, in the second box in e 
row for Range of Total Error. 

For the situation where there is a multiplicity of examiners, n 
exact solution has yet been found. There is little difficulty, of course, i 
all the examiners have marked the same batch of answer books, ° r 
strictly parallel samples (as is done through “randomization” at Gauhati) 
More often, however, (as in our second study) we have only the results 


of pairs of examiners on different batches of answer books. The use 


of 


the formula for the standard error of a mean would assume that each of 
these means is a random sample of a set of means. Yet the actual 
differences found between the means of various examiners are far too 
wide to support such an assumption. 

And for the far more important problem of the range of total error 
of marks (rather than of marking, as we have been discussing above), 'ye 
again have found no solution. A lower bound, of course, would be this, 
based on the widest difference in means found : 

± ~ & ±SEM„ h . ( 7 -«) 

But it would be preferable if we could find out the effect on the difference; 
between the means, of the content as well as or the examiner components 
of reliability. 


10. Relationship to the studies m this book ' 

Two studies are reported ,n this book. One of them, involving ' 
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for the situation where parallel tests arc not available, and dense their 
value from the assumption that thc> arc rough estimates of /■« Em- 
pirical studies have shown that r <e is generally lower than r kh probably 
because of time effects and because of a greater play of sampling error. 
The junior author has recently completed the first empirical check ever 
made of the relationship between Harper’s r k i and two measures off« 
(Gulhksen’s and Lords), for essay-type examinations (Misra, 19 10) 
He found that the relationship between the two for traditional type 
examinations is approximately the same as that between Kudcr-Richard- 
son reliability and parallel forms reliability, m objective examina- 
tions 

Marks reliability Tor the parallel examination situation, there 
are four estimates of marks reliability r„, r„ r, 3 r tt For each of 
these, the marks of two different examiners of two different forms are 
correlated For a single representative coefficient, the mean may be 
taken In many situations however, it may be more realistic to take the 
lowest value It is not the mean but the lowest value which tells what 
damage can be done The mean can give us a misleading sense of 
confidence— -like the man who started to climb the Empire State building 
because he had heard that the average height of buildings in New York 
was only three hundred feet 

When we have only a single examination marked by two examiners, 
and an internal consistency coefficient of content reliability, the situa- 
tion is the reverse of the above Gulhksen’s formula (Formula 7 2) is 
now rewritten in the following form 

r AA ~ r Kk r ll U ^ 

It will be noted that we have substituted r M the estimated marks 
reliability for a single form, for r Ali the actual correlati on for parallel 
forms We have also made the transformation \^r i3 rj* = r , t 
Either r n or r M can be used, whichever is available, though r l4 is 
preferable There is, of course, a separate coefficient of marks reliability 
for each examiner, unless the content reliabilities for both of them happen 
to be identical 

Again, unfortunately, in the most common situation — a single 
examination, marked by a single examiner — we cannot calculate marks 
reliability This is because the examiner reliability figure is not available 
In a large public examination it may pay to have a representative sample 
of a hundred or more answer books re marked They could be sent in 
batches of, say, 20 each to all of the examiners involved Re marking 
just 20 answer books would not take much time Though mdivjdual 
coefficients would not be reliable (and, in fact, need not be calculated), 
a coefficient from the pooled data would be a fair estimate of examiner 
reliability (However, see discussion of Fig 10 1) 
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If even tins cannot be done, it may ^'“' r h ‘° e without 
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are drfferen. for each examiner ^ ttdl examiner »» J» 

SC i (scaled), because with unsealco . datd trI or in terms of «« 
o™ settle Thus we are < l «” b ^ mon mean and standard toabon- 
scale) ir marks are scaled to a cxam , nations, there 
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a s.ngle standard '™‘°“^ c a,ed) 1. is 

Standard error of marking K 0 „ bcre we have no r mot 

mg the absolute measures ; o 0 nd ^ want to 

nets. each with tus own md ^ ^ oa nev i P»»"f 
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is the difference betsseen by lbc mean. ^ ’^mttaree. 
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«i nrt concentrate o „«urac that tr ••tests / 
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standard delation » « lhemor component tsnsaers of 
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the two tests (m our' * coBeWe rero, then l 
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examiners for the same batch of answer books, required some specialized 
coefficients which will be discussed m the report of that study The 
second study followed the "single examination, two examiners* model, 
with both self-consistency and inter examiner coefficients The remaining 
"cells” in Table 7 1 are not directly relevant to these two studies They 
arc included in the table for completeness, and also to add perspective 
to the coefficients and measures used 

We cannot end this section without acknowledging, again, the 
existence of a much more complete, and probably much better set of 
concepts developed by Oscar K Buros (1963) Had these come to our 
attention at the beginning, instead of after all the computer work was 
fintshed, this entire project might have been cast m a different mould 
Any future researcher in this field should certainly be aware of Buros* 
conceptualizations before finalizing his research design 

1] A final note 

Several of the concepts in this chapter arc still quite tentative, and 
arc based on insights which— though sought after for several years— 
developed m the mind of the senior author only near the very end of the 
project They are included here to stimulate thinking and discussion, 
even though they have neither been tested yet in practice, nor are they 
reflected fully in the chapters which follow This may be frustrating 
to some readers The alternative of not including these tentative concepts 
here, seemed even less desirable 
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“Ninety Marking Ten”— Detailed 
Report * 


T HE BASIC outline of this study has already been given in 
Chapter III and the experimental design elaborated in Chapter 
VI(B) 

In 1966 Taylor Tluanga and Misra published A Study in Multiple 
Marking Since this is the only oth-r study of its kind done in India, 
we will refer to it several times m relation to our own findings The 
general similarity of the finding> is the more striking wnen seen in terms 
of the differences in experimental design between the two studies These 
differences may be summarized as follows 


Subjcn 

Level 

Maximum marts 
Number or answer books 
Method of rcprodu tioi 
Number of examiners 


Harper el at 
History 

Higher Secondary 
50 
10 

fa-simile 

90 from same State 
and Board 


Taylor el al 
English 
B A. 

100 

100 

lypingfcyclostyhng 
19 from 9 States 
18 colleges 


However, the parallels are limited by a striking difference which 
was probably more due to practical limitations m data handling facilities, 
than to any theoretical considerations Taylor et al studied 19 examiners 
and 100 answer books, so their analysis deals primarily with examiners, 

•Parts of this chapter have already been published (Harper 1967) but this chapter 
contains several important analyses that did not find a place in the summary article 
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*c studied 90 examiners and 10 answer books, so our analysis deals 
pnmanJy with answer books Thus the two stud.es complement each 
other, and only occasionally can they be compared Erectly 

It is interesting to note that both the present study and Taylor et al s 
v.erc carried out within the same two year span, and that the same 
junior author was involved m both of them 


1 frequency distributions of marks 


Tab/e S I gnes the frequency distributions of the marks awarded 
each or the ten candidates (A to J) These are marks out of 50 for a 
History examination In the first 10 columns the results indicate the fre 
quency with which each mark appeared when answer book A was marked 
first, B marked second, etc The second set of 10 columns indicates the 
frequencies when the marking was in reverse order, i e answer book J, 
then J, then H, etc Tabic 3 1 (Chapter III) combined the distributions 
for the two different orders into a single tabfe 

By inspection of the two sets of columns it can be noted that the 
distributions for each answer book are roughly similar, for the two 
orders of marking Neither the means nor the medians differed by 
more than two or three points These differences are treated in more 
detail in section 12 on * Position Effects' 

We have already commented on the fact that the sums of the two 
OTdcrs of presentation (columns A J and J A) show nearly identical 
distributions, thus establishing our confidence m the total sample 

We hate already noted also that the highest mark awarded any 
candidate by any examiner is 38 This is 76% marks which, in this state, 
means the award of a •‘Distinction” to Candidate E by one examiner 
Note that this same candidate was also filled (less than 33% marks, 
ie 16 or lower) by seven examiners 

The lowest mark awarded was 2, or 4% One of the candidates 
receiving this extremely low failing mark was also passed by two of the 
90 examiners 

The widest distribution of total marks is for answer book t (see 
Table 3 1), a range'* of 28 marks The narrowest distnbutj on (Candidate B) 
is one half of this, i e 13 This still represents a minimum difference, 
for this study, of 26% marks between the highest and lowest awarded 
a single candidate Perhaps teachers and students should be most con 
cemed with the maximum range (56%) as this defines the risktey 


•In this paper (he convention of defining the range as mcludms the b.ghest an ^ s! 
mark has been followed Thus it the highest mark is 3 and ihe * ^ 

range of 3 Some writers take only the difference as the note. would ca!! 111,5 
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take when students sit for an examination Thus a candidate in a Board 
Examination in class X History may get cither a Distinction or a bad 
Fail, depending purely on “luck” (i e. who his examiner happens to be) 
— and regardless of his personal “merit”. 

How often might this happen ? The correct mark (sec Chapter VII) 
or average given by these 90 experienced examiners to this particular 
answer book was 23.13 (cf. Tables 3.! and 8 6). It is not unreasonable 
to assume that, in a large examination, say 1500 “should’' receive a 
mark of 23. We can also assume that these candidates are roughly 
similar to Candidate E. and the examiners are roughly similar to our 
ninety. As many as 17 of these 1500 candidates may receive a Distinction, 
while about 117 of them will fail — c\en though they are actually of 
equal merit. 



N , F !f U r.u-' 1 - pre5ei1 ? the tota! frcl I uen ' ; y distribution graphically. 
Note that this is not the distribution of marks as desired by any one 
examiner. It is the combined distribution of ninety examiners, with diffe- 
r Standa , rd devmi<M! - Thus, it is a much wider distribution 

distributions “r* nv® 6 cxam,n51 ' wou, d be. Remember, however, th3t 
,S b " “ ot PdWic examination marks are also widened by the same 
effect-the pooling together of the marks of many examiners, each with 



examiners) asd Total (90 examiners) 


TOTALS 






1 ‘NINETY MARKING TEN 


—DETAILED REPORT 


his own individual "standard’ of marling 

below a particular Division, P arl ‘^ * experienced examiner, 

34%, only 26 a. 32% but again .54 «£ ^ ^ 

asked to comment on tn s t ^ a t ihc total mirks 

want to fail a student b> i one mar t0 questions 

come to 16 he usually ®?d s hen * ^student gets i total of 

to make the total 17 H • nv0 cxIr a marks to pass 

15 the examiner usually «' «* ‘f * J mrki at 15 and 17 and not 
him That is why you have got a pue up 

at 16 ’ _ t niMsion It is interesting, though that 

-me same is true for First °" 1S „ Stcond and Third Perhaps 

that there is no J-efiect at the dr s nd 2n d Third docs nor 

this is because the diiferenuationt *1 tween *» ^ do 
carry nearly the consequences ««<"., J( y m ,Vs right m the 
There ts, however, apparently sm a, ler e|ftct ,he middle 

middle of Class Tvso-an d a s " f, T S he sa m= crpenenced examiner 
of Class Three How to «>*”"£* ™ !00 , can think of .. tta 
quoted above offers this ' a boot he probably d ' cldcs 
when an examiner marls a 4hic h division the student d « 

basis of the first one or twoquesn Dmsl0n then on i an 

If he decides thattlie student d because if h= S' 

msmm 

Tables St and 3 1 £e For example, J 

near the average M' 1 .®'" a , FlI5l D.x«'°" 30 " ' L« are 
average is 27 marls, ' pools I, F, and A - “ E lh means 

a, Third Division » for answer. ^ ^ ^ f and fi ^ „ „ 

for pooled group i results, ^ j,uatlon 

alsoexhibstnJ-effem hesra i, ng orroarl»^ „ ou , d hast: the 

All this nnmenn « ■" u rc Such «=>« ■"“ “ „ other 

methods a quesUonub a P marks , bu , with * shou !J b£ b} a 
same distnbut.on as the ^ prions Sea 
than the dividing pi s h 3 ll see Diet 

normalizing procedure, ns we 
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2 Classes or Diusions 

The lines across Table 8 I represent D,v,sions-the lowest being 
the distinction between passing and fading In many cases ■«««»' 
marks themselves, but rather the Divisions (or Cfas *s) that 
If tve can treat Divisions as roughly accurate, it would be *** P 
Can we » Sec Tables 3 2 and 8 2 Only one answer book falls dthn tel) 
in only one Division (Candidate J was faded by all examiners) Another 
three were in only two Divisions (Even the most accurate 
ting test will, of course, put some borderline candidates in two Divisions 
But these three were not borderline) Tour candidates were placed in 
three different Divisions b> \anous examiners and two were p ace i 


Table 8 2 also makes it evident that the two batches of examiner* 
agreed quite closely in distribution of marks, and that the or er in 
which the answer books were marked made no striking difference m 
these distributions The large difference for I is due to the fact that many 
more examiners marked it just at the passing point (17) when it was secon 
than when it was nincth in order of marking Again however, there i 
no consistent tendency to do this for all answer books 

It is obvious from all these tables that, for a particular answer 
book, even 4 Division” has very little meaning in evaluating candidates 
It should be remembered, however, that when several unreliable u 
inter-correlated scores or marks are added together (and several In ian 
studies have shown marks in different subjects to bc correlated), 
the reliability of the sum is raised This follows from the formula tor 
“battery reliability \ which can be found in standard texts Division 
or Class on the aggregate may, therefore, have considerably more 
meaning than Division or Class on any single paper * 

It is also interesting to note that the range of marks is wider for the 
better candidates, and lowest for those near the bottom (This is further 
confirmed by the s’ s in Table 8 6, and is the opposite of the phenomenon 
found for rank orders ) Thus we may conclude that we can place more 
reliance on the statement “He is a failure” than on the statement He 
is a First Class student,” when marks are not scaled 


♦There seems to have been no study of aggregate marks reliability, as against reli 
ability of separate tests or papers However, one line of evidence comes from the fact 
that studies done in India have generally shown a much higher validity for predicting 
aggregate marks than for predicting marks in any single subject There are also some 
useful indications given m the study of Datta (1967) (This able young scholar died 
tragically in an air crash It is hoped that the research which he completed just before 
his death will keep his memory alive ) Datta correlated aggregate marks on half 
yearly and annual examinations for eighth grade m 15 schools in Tripura The co- 
efficients for individual schools ranged from 80 to 93, with a median value of 88 


TADLF 8 
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3. Reliability of passjfail 

Even on the crucial distinction between “passing” and “failing”, 
there is very little agreement among these experienced examiners. Only 
one candidate was passed by alt 90 examiners, and only one was failed 
by all. Eight of the ten candidates were passed by some .examiners and 
failed by others.* These facts suggest that for over three-quarters of 
the papers in any public examination, we cannot even be sure whether 
they are realty of “passing” or “failing” quality. Thus, whether a parti' 
cular candidate passes or fails is as much a matter of chance (the chance 
of who his examiner is) as of anything else. 

Though uncertainty may be inevitable at the border, where 
a change of one mark (16 to 17, or 17 to 16) makes the difference, most 
of the affected candidates were nowhere near the border. Since failure 
in a single paper can mean failure in the examination, no matter how 
high the remaining marks, this is a serious matter indeed. The only 
solution — other than better and more reliable examination methods — 
is to apply both randomization and scaling of answer books (Taylor 
1963c), and Taylor’s (1964) grace marks techniques (see Appendix B of 
this volume), which take realistic account of examiner error. Dr. Taylor’s 
experience (mentioned in a personal communication) was that when 
these methods were applied at Gauhati, the pass percentage went up. 
It is thus obvious that a substantial proportion of the “failures” in our 
public examinations are failures of the system, not of the candidates. 

The system fails, and we blame the students! Why should any student 
be labelled a “failure”, and his life and finances ruined, when his “failure” 
is not due to lack of ability on his part but to lack of reliability in the 
examination — especially when the simple techniques of randomizations, 
scaling, and a scientific (rather than arbitrary) grace marks system can 
eliminate most of these false “failures” ? These minimum changes 
require no change in the traditional type of questions, no re-education 
of examiners, no re -orientation of teaching methods, no re-education 
of students. They require only a simple administrative decision about 
the way m which examination answer books, and examiners marks, 
are to be treated in the Board’s office. Is there any reason why this decision 
cannot be made tomorrow ? 

4. Means, standard deviations , and pass percentages of examiners 

Thus far we have been dicussing variability of marking tn terms of 
the marks awarded each of the ten answer books. The source of this 
variability, of course, is not the answer books themselves, but the 

•Though B and G were failed by two examiners each, only one examiner failed both 
oi tnem — a typical lack of agreement. 
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examiners Taken as 90 individuals how do the examiners vary ? 

We have already seen m the Highlights chapter (HI Seuion 4 Table 
3 3) that the pass percentages varied from 10% to 80% The comparable 
figures from Taylor et al (1966) are 7% to 62% 

Perhaps more precise measures of how examiners differ are the 
means and standard deviations of the marks they award Table 8 3 
gives the frequency distribution of the means of the 90 experienced 
examiners This shows how widely these examiners all presumably 
trained to examine at the same standard differ on the average merit 
of this batch of ten answer books The range is from 8 1 to 22 2 marks 
out of 50 In other words one experienced examiner thought the average 
merit of this group of candidates was only 16% or marks— while another 
thought their average merit was 44% of marks The comparable figures 
in Taylor et al s study are 19 2% to 35 0% The larger number of answer 
books (100 instead of 10) tends to make the extremes regress toward 
the mean , However, even in Taylor et al s study the highest mean u 
nearly double the lowest 


TABLE 83 

Frequency Distribution op the Means of the Marks Awarded by 90 Experienced 
Examiners to the Same 10 Hdtcry Answer Books 


Means Frequencies 

22-0-22.9 1 

21 0-215 — 

200-20.9 1 

19 0—199 7 

18 0-189 9 

170-179 10 

160-169 15 

150-159 10 

14 0-149 12 

130-13 9 12 

110-12.9 4 

110-119 2 

JO 0-10 9 4 

90—99 2 

80— 89 1 

Tolal 90 


Range of Means *=* 8 I 222 

°"= - w 

° n,y ,bu 3 ' Trap 
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the labour required to calculate these did not seem 
jt rr. by the potential results fsmee this study was done by hand, and 


nci by computer) Our main consideration fas will be seen later) was 
the average standard deviation of the marking s>stem for the batch of 
examiners as a whole. This a must be defined as the a\eragc a of the 


90 examiners, rather than the a of the average marks. For this purpose, 
a rough estimate seemed to be an adequate substitute for the detailed 
calculations. A cue wav taken from Jenkins’ method for estimating a. 


Snedccor (1940, p85) reports (and Guilford (1956) quotes him) that 
the average range for 10 Treasures is 3.08 times their standard deviation. 
This, then, gives us a method for estimating a from the range. The 
estimation process is given in Table 8.4 which, incidentally, also gives a 
frequency distribution of ranges and estimated standard deviations for 
the 90 examiners. While this method may not give a very accurate result 
for any single examiner, it is reasonable to assume that the average 
over 90 examiners is a fairly accurate figure. Thus we arrive at a standard 
deviation or marks of 6.367.* Since these are marks out of 50, the figure 
jields 12.73%. This agrees closely with other studies done in India, and 
suggests that this sample of 10 examination books is a representative 
sample of High School History examination books or this type. 

Recognizing the limitations of each individual a tU wc still may 
compare these results with those of Taylor et al. Translating their “marks 
out of 100” to “marks out of 50”, the comparable range is 3.48 to 5.58 
for their nineteen examiners. Their mean a was equal to 4.54 out of 50. 
as against ours of 6.67. Whether these differences represent differences 
between the marking of History and of English, or whether they have 
some other source, we cannot be certain. 


It is clear, however, that both in terms or mean an'd of standard 
deviation, each individual examiner seems to have his own particular 
marking system or scale. Hgurc 8.2 examines the relationship between 
range (as an estimate or standard devia tion) and mean, for the "scales” 

•After these at had bm. «mmed a focal offaed ron „„ „ rihc ^ 

1ST SSS-Si 57SSST.S - ^ r ,Sed random 

tQ fr0m a 

very close!'' ^ from lhe entire 90 . 
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of averages (and not the standards of the individual examiners) that 
remains relatively steady from year to year. Sampling theory would 
support this speculation. 

Examiners differ not only in mean marks, but also in the standard 
deviations of their individual marking systems. (It is the complex inter- 
relationship between these two which produces the differences in pass 
percentage.) Since a standard deviation based on only 10 marks is of 
doubtful reliability, the labour required to calculate these did not seem 
justified by the potential results (since this study was done by hand, and 
not by computer). Our main consideration (as will be seen later) was 
the average standard deviation of the marking system for the batch of 
examiners as a whole. This a must be defined as the average a of the 
90 examiners, rather than the a of the average marks. For this purpose, 
a rough estimate seemed to be an adequate substitute for the detailed 
calculations. A cue was taken from Jenkins’ method for estimating o. 
Sncdecor (1940, p 85) reports (and Guilford (1956) quotes him) that 
the average range for 10 measures is 3.08 times their standard deviation. 
This, then, gives us a method for estimating a from the range. The 
estimation process is given in Table 8 4 which, incidentally, also gives a 
frequency distribution of ranges and estimated standard deviations for 
the 90 examiners. While this method may not give a very accurate result 
for any single examiner, it is reasonable to assume that the average 
over 90 examiners is a fairly accurate figure. Thus we arrive at a standard 
deviation of marks of 6.367.* Since these arc marks out of 50, the figure 
yields 12.73%. This agrees closely with other studies done in India, and 
suggests that this sample of 10 examination books is a representative 
sample of High School History examination books of this type. 

Recognizing the limitations of each individual o ufi we still may 
compare these results with those of Taylor et al. Translating their “marks 
out of 100" to “marks out of 50", the comparable range is 3.48 to 5.58 
for their nineteen examiners. Their mean a was equal to 4.54 out of 50, 
as against ours of 6.67. Whether these differences represent differences 
between the marking of History and of English, or whether they have 
some other source, we cannot be certain. 

It is clear, however, that both in terms of mean and of standard 
deviation, each individual examiner seems to have his own particular 
marking system or scale. Figure 8.2 examines the relationship between 
range (as an estimate of standard deviation) and mean, for the “scales" 

•After these a s had been estimated, a friend offered to run some of the data through 
a computer. Because of computer limitations, we had to select a stratified random 
sample (described later) of 49 examiners’ results. For this sample, the mean a was 
6.165, and the range 2.7 to 10. 1. These exact calculations from a sample of 49 agree 
very closely with our “inexact” estimates from the entire 90. 
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of the 90 experienced examiners There seems to be virtually no relation- 

ShlP Thus the candidate in a large public examination with, say, one 
hundred examiners, is faced with one hundred different marking scales 
on the basis of which his “merit * may be reported Not only does this 

TABLE 8 4 

Estimation of a From Range for 90 Examiners, to Find Mean <r of Marking 
Scale Used 


Range* 

R 1* 

f 

O'* 

8 

7 

1 

Z3 

11 

10 

1 

3 2 

13 

12 

2 

3.9 

14 

13 

4 

42 

15 

14 

5 

46 

16 

15 

6 

49 

17 

16 

5 

5 2 

18 

17 

4 

5 5 

19 

18 

8 

58 

20 

19 

8 

62 

21 

20 

8 

65 

22 

21 

7 

68 

23 

22 

6 

71 

24 

23 

8 

75 

25 

24 

4 

78 

26 

25 

4 

81 

27 

26 

3 

84 

28 

27 

2 

88 

29 

28 

2 

9 1 

30 

29 

— 



31 

30 

1 

97 

32 

31 

1 

101 


90 

<r „t *= estimated standard deviation 

- 1 „ Zf(R- 1) 

3 08 N 

_ 1765 

(3 08) (90) 

= 6367 


*We have followed the convention of defining the range as including the highest and 
lowest mark Snedecor has defined the range as the difference between the highest 
and lowest Therefore one has had to be subtracted from these ranges in order to 
make them equivalent to the ones on which Snedecor based his table 
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and 11 Third Class. The examiners may think him best, but they certainly 
do not agree on how good he is. 

There is somewhat less agreement as to which candidates are worst. 
But the most disagreement is about the average candidates. Candidate F, 
with an average rank order of 4.88, is given, by one examiner or another, 
almost every rank available from 1 to 10. That this is not an unusual 
result is indicated by the fact that Taylor et al (1965 b) report that one 
of their answer hooks was ranked 3rd by one examiner and 98th by 
another. 

Unfortunately, under the Indian system the decisions as to whether a 
candidate should Pass or Fail are made in the middle of the range. If 
either a much larger, or a much smaller percentage were allowed to pass, 
we could be much more certain that “Pass” and “Fail” represent descrip- 
tions of actual ability or achievement, rather than just good or bad luck 
for an average candidate. 

“We started with the statement that scaling ’will eliminate differences 
in marks awarded by different examiners, if those examiners at least 
agree on “order of merit”. We see that they are by no means unanimous. 
However, as already hinted at in Chapter III, the agreement on ranks is 
somewhat better than the agreement on marks. Therefore the conversion 
of each examiner’s marks to a common scale will at least reduce the 
variability in marks We will postpone further discussion of this to the 
chapter on “Scaling”. 

6. Means, medians and standard deviations * 

Table 8.6 reports the mean mark, mean rank order, median mark 
and standard deviation of marks and rank orders, for each of the tea 
answer books. Table 8 6(a) gives these facts for the two orders of marking 
separately. Table 8 6(b) gives them for the pooled data. In Table 8.6(b) 
they are arranged in order of the size of the mean, so as to bring out the 
relationships among these various statistics. We will concentrate our 
discussion on these pooled data. (The reader is referred to Chapter VII 
for a detailed discussion of the terms used.) 

Correct mark : The mean of the 90 marks awarded a particular 
answer book can be considered to be, practically by definition, the 
correct mark for that answer book. The definition, of course, requires 
an infinite number of examiners, but 90 i s a reasonable sample. This 
makes the standard error 9\ (i.e. the square root of 90) times smaller 
♦ n a a standard error of marking for any individual examiner. The 
standard errors of the correct marks (not reported in the table) range 


•The organization of this particular section ones a deep debt to Taylor ct al. 0966 b). 
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book experiment are comparable. As against our standard error of 
marking of 3.66, the comparable figure for Taylor et al. (1966 b) is 2.21. 
Whether these differences are due to subject matter (history vs. English) 
or to the nature of the sampling (90:10 vs. 19:100) is not known. However, 


TABLE 8.6 (a) 

Ten Answer Books Marked by 90 Examiners. Means, Medians and Standard 
Deviations of the Marks of 90 Experienced Examiners for Each Answer Book, 
and the Means and Standard Deviations of the Rank Orders Assigned, for 
Each Order of. Marking. 


i 

Marks 

Ranks 

1 

Mean 

a 

Mean 

A 

1460 

3.393 

523 

B 

11.69 

2 484 

7.33 

C 

19.64 

4.178 

3 15 

D 

26.55 

4910 

1.23 

E 

22.93 

4 256 

212 

F 

15 52 

3 884 

4.65 

C. 

981 

3.094 

8.44 

H 

1042 

3.343 

8 01 

I 

14.24 

3.289 

5.48 

J 

8.10 

3 260 

9 36 



Marks | 

Ranks 

Mean 

cr 

Mean 

J 

9.38 

2.780 

9.10 

I 

16 56 

3.524 

4 36 

H 

11 40 

3 613 

7.52 

G 

9.73 

2 648 

8.76 

F 

1521 

4 068 1 

504 

E 

23.32 

5.182 ! 

2 08 

D 

27.35 

4.250 

1.18 

C 

18.42 

4272 

3.44 

B 

10 83 

2 652 

7.99 

A 

14 42 

3.512 

5.52 


TABLE 8 6 (b) 

Ten Answer Books Marked by 90 Examiners Means, Medians and Standard 
Deviations of the Marks of 90 Experienced Examiners for Each Answer Book, 
and THE Means and Standard Deviations of the Rank Orders Assigned, 
Arranged in Order of Average Marks. 


Answer * 
Book 

Marks 

Rank Order of 
Mean Marks 

Rank Orders 

Mean 

s 

Median 

Mean 

s 

D 

26 93 

4.562 

27.1 




E 

23.13 

4 750 

23.1 





18 99 

4 249 

194 




I 

15.48 

3 592 

15 J 





15.36 

3 964 

15 3 





14 50 

3.439 

14 3 





11.23 

2.597 

11.3 





10 94 

3.504 

10.7 





9.77 

2.848 

9.6 






3064 

8.4 

10 

9.2 

.98 


rorlhc 10 answer books : M-15 52; Mem ,_3 W-santod error of n*,b»s 

s of Mean Marks -5 965- j of correct mark 

Tor total distribution of Marks (from tables not shown here) 

M- 15.32; or— 7.187 

a of Examination (see Table 8.4 for method of estimating)*»6.367 
Note : s b estimate of population a from sample 






















1 NINETY MARKING TEN 


-DETAILED REPORT 


several stud.es have shown the content and/or marks reliability of 
English examinations to be higher than, for example, trad, oooal mathe 
maucs examinations (see Chapter V), and just possibly the same deters 
may be operating to make the marking or English examinations more 

reliable than those of some other subjects . 

» , eraqe sca l e used by examiners There is as we h 
(Figure 82) a wide variety of scales ,n use by our 90 exponent, 
examiners all, supposedly, working under the same 
There is, perhaps some advantage ,n knowing the eharaetensties 

average scale used , rate was quite simple 

The calculation of the mean of this at erase stele ' u , 

But to calculate and average 90 standard ralhcr tar£C J0 h The 
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the extra large gap between the mearu ^^towe^mean 
an extra large gap between eir than do the 

marks tend to “stretch out” more at the top end of thus ^ ^ ^ 

mean rank orders This suggests Scaling will quite fairly 

(as will be recommended in the chapter o 8 y lower the 

reflect the distribution of ability m the group, although it may 

' e \^o™cfmark?(mea„s) and standard errors of M <£> 

seem to be fairly highly correlated, d ® sag re=ment 

distribution of errors This means that there is _ wider a.sag 
among examiners as to what marks to award to the best students, 

t0 1 WhTn tvflook at the rank orders, however, th=s,tua.,on,s r =ve«d 
The highest rank orders have the narrowest distributions This suggests 
that rank orders are much more reliable for the more able 

than F“a S compar,so„ of means with medians suggests that there 
is relatively little skewness in the distributions, although there is s 
slight tendency for the high marks to be truncated at _the uppJ 
and the low marks at the lower end This is as it should b t for random 
errors, subject to the limitations near the ceiling and floor of the 


8 Reliability 

Any reader with training in educational measurement, or even 
some familiarity with psychological tests or objective examinations, 
will want to know, "What is the coefficient or reliability ol 
examination 8 9 ” h ut 

In the broadest sense, our entire discussion has been aoo 
reliability We have approached this in terms of variability in marks, 
in means, in pass percentages, in rank orders We have also seen 
(Chapter VII) that “ the reliability * is an ambiguous terra, and that 
when talking of traditional essay type examinations— we must thin 
of at least three different types of reliability 

We have studied extensively only one aspect of reliability 
the reliability with which the examination is marked (It should e 
remembered that for objective examinations, there would be no vana 
tion in marking at all ) Other factors which are known to lower reliability 
have not been studied variability in content (i e the traditional para e 
forms situation), variability m performance (i e test-retest), the effects 
of allowing a choice of questions (In this examination, all candidates 
had selected the same five questions) Nevertheless, we will attempt to 
give some answer to the question about “the coefficient of reliability 
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First order of marking (42 examiners) = 83 
Second order of marking (48 examiners) — 85 
Total (all 90 examiners) — 83 = examiner reliability* 

A second approach was to correlate a number of actual pairs 
of examiners A 1% sample (40 examiner*) was drawn by reference to 
Snedecor’s table of random numbers (Snedecor, Statistical Methods 
(1940) pp 10— 13) Reading across a line, chosen at random, beginning 
at a spot chosen at random, pairs of digits were read off to represent 
examiners For example, the first senes of digits read 803270267J98 
Thus we correlated examiner 80 with 32, 70 with 26, 71 with 98 The 
examiner numbers used were the original ones, which ran from 1 to 100 
So the only restriction was that if the number of an examiner who had 
not returned his marks sheet was drawn, it was ignored and the next 
one drawn to replace him Thus the series 7631062905 would yield the 
pairs 76 — 6 and 29 — 5 as 31 had not returned his marks sheet 

This is drawing “with replacement”, and resulted in se\eral 
examiners, and e\en one pair appearing twice If we had been sampling 
only the 90 examiners, we would ha\e done it ‘‘without replacement”, 
i e we would not have allowed any examiner to be included more than 
once The present method seemed justified because we were not really 
sampling 90 examiners — rather, we were attempting to draw a sample 
representative of the total population of examiners, of which the 90 
were themselves only a sample 

For simplicity, the correlation between each pair was again found 
by rank order methods, although this may obscure some of the results 
of irregular distributions that would have affected the product-moment 
Not all the figures have been checked (although the calculations are so 
simple as to make serious errors unlikely) However, all unusually low 
or unusually high values were double checked 

The results are presented in Table 8 7 It is interesting to note that 
the median ( 86)** of this one pier cent sample of rank order correlations 
is \ery close to the value of 83 found by formula 8 2 

Remember, again, that 83 represents the reliability of marking 
only if the marks of all examiners are scaled to a common mean and 

♦Taylor tl a! had computed Kendatt s Coefficient of Concordance for their data 
The result was 75 For comparative purposes we also computed it for our data 
obtaining a result of 836 

**In the later computer analysis (mentioned in a previous footnote) of a stratified 
random sample of 49 answer books the median of 1176 wiercorrelations of examiners 
was 88 Since the product moment r can be expected to differ from rho when examiners 
nwks are not normally distributed, this seems to be a very dose agreement indeed 
The median r for our sample of 40 pairs m Table 8 7 was 895 Taylor ct al s mean 



'NINETY MARKING TEN”— DETAILED REPORT 


U7 


TABLE 87 


Rank order correlations 
between 40 sample 
pairs of examiners, m 
the order in which they 
were drawn 


Examiners rho 


80-32 

86 

70—25 

98 

71—93 

89 

J9— 18 

92 

43—42 

95 

90-31 

74 

04—92 

96 

68—95 

87 

44—11 

90 

54—96 

86 

20-16 

93 

80—92 

.91 

71-55 

87 

16-57 

96 

73—12 

90 

66 — 63 

81 

42-07 

83 

02-19 

85 

81-36 

85 

94-19 

71 

43-47 

57 

36—64 

76 

10-25 

72 

14— 55 

.79 

48-02 

.75 

14-55 

.79 

97-43 

83 

22-83 

80 

13-66 

81 

34—59 

.87 

87—56 

86 

66—76 

70 

21—70 

99 

02—75 

86 

88-41 

92 

29-52 

93 

84—89 

S4 

51-02 

.91 

72-30 

93 

56-46 

84 


Frequency distribution of rank order correct ow 
among 40 pairs of examiners 


rko f 
99 I 
98 2 

97 

-95 2 

95 I 
94 

9} 2 

92 2 

91 2 

90 2 

89 t 
88 1 
87 3 

85 4 

415 2 

84 3 

8) J 
82 

81 l 
80 l 
79 2 

73 
77 

76 I 

75 1 S jn ficant *t I *. 

74 I 
73 

72 I 
71 I 

70 i 
69 
68 
67 
66 
65 
64 
.« 

« 

61 
jft ) 

J7 

J* 

57 I 


S fx*K*n *« *** 
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standard delation Since Ind.an examinations are generally not so scaled 
(with the notable exception of Gauhati), this apparently high figure 
can be quite misleading It is worth repeating the example of an actual 
83 correlation gnen in Chapter III. and adding some further statistics 
to It 


Marks by 

Candidate Examiner Difference 

x r 


A 

7 

23 

16 


B 

5 

15 

10 


C 

8 

27 

19 

rxY = 33 

D 

20 

34 

14 

O* = 4 66 M X — 8 1 

E 

11 

29 

18 

ay = 6-52 My = 222 

F 

6 

26 

20 


G 

2 

16 

14 

Standard errors of marking; 

H 

7 

17 

10 

of A = 1 91 

1 

10 

22 

12 

or Y ~ 2 67 

J 

5 

13 

8 

Mean of actual d fTe'ences between 
marks = 14 1 

Ave-age difference 


14 1 



Thus if X’s marks were converted to Y’s scale, the maximum diffe- 
rence between the two marks for any candidate is likely to be about 6* 
The standard error of marking for our data as a whole is found, 
by formula (see Chapter VII), from the average standard deviation of 
6 367 and the coefficient or examiner reliability of 83 This standard 
error of marking is 2 62. This would, of course, apply to scores on the 
same scale The contrast between this and the standard error of marking 
(unsealed) reported in Table 8 6 (3 65) show's a part of what is lost by 
not adopting proper scaling procedures The rest is shown by the large 
differences in means in Table 8 3 

Content reliability We have already noted that examiner reliability 
is misleading because many assume that it means the same thing a* 
objective test reliability But the examiner reliability of an objective test, 
of course, is 1 0 Thus, before comparing our reliability figure with that 
or an objective examination, we must find content and marks reliabilit) 
(See definitions in Chapter VII ) 

Since we had selected ten answer books in which the candidates 
had all answered the same five questions, we were able to use Ebel’s 
intraclass correlation (Ebel, 1965, Guilford, 1954) to find the content 
reliability of the examination For this calculation we chose the work 

Tbe formula Tor the sta ndard deviation of differen ces between marks is as follows 

<rx-Y~ \'aX' + CY x — 2r X yv x C y (8-2) 
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they were participating in this experiment— may have marked the papers 
with extra care. The original examiners, marking under the usual condi- 
tions, may have been more careless. Is this true ? 

Table 8.8 gives the data. The correct mark is, of course, the average 
mark of the 90 examiners. Against this standard, the original examiners* 
did about as well as the 90 re-examiners. The average of their ten marks 
was only slightly below the average of the correct marks (15.1 vs. 15.5). 
The widest error was about 4 marks. The average error of 1.9 marks is 
slightly lower than the average (2.5) of the 90 examiners. The rank order 
correlation between the correct marks and the marks awarded by these 
original examiners is +.89. 

This cannot be compared directly with our average inter-examiner 
correlation of +.83. Rather, it is a correlation with the average of these 
examiners. Cureton (1965) provides a formula for the average rank order 
correlation of the examiners with the criterion. Translated into the 
symbols we have been using, and letting rho n stand for the average 
rank order of the examiners with the criterion, Cureton’s formula reads 
as follows : 


rho te 


12 E yS 3 (N + 1) 
k (N* — N) N- 1 * 


(8.3) 


where the only new term is y , which is defined as the “criterion rank” 
of each answer book i.e. the rank order of its average mark. 

Calculations for the 90 re-examiners give us a rho, 0 of + .90. The 
rho „ for the original examiners, +.89, is almost exactly the same.** 
Taylor et al (1966 b) found that their original examiners correlated 
4- .69 with the correct marks, while their mean r (tetrachoric) of their 
19 examiners with the correct marks was .86. Neither the difference 
between Taylor et al’s two coefficients, nor the difference between theirs 
and ours, is significant. This does not mean, of course, that it is incorrect 
to say that their original examiners correlated lower than our original 
examiners or their re-examiners — it merely means that no inferences 
can be drawn from these facts, about the general population of “original 
examiners” in history and English. 

•There were six original examiners. One of them marked three of these answer books, 
two marked two each, and the remaining three marked one each. On the average, 
a sample taken from several examiners is likely to be closer to the mean than one 
taken from a single examiner. This jj doe to what statisticians call “regression toward* 
the mean”. 

’•In the computer analysis of a sample of 49 of the 90 examiners (mentioned in an 
earlier footnote), the product moment correlations with the correct marks ranged 
from +.E0 1 to +.994. with a median of +.947. By definition, this is the "index of 
reliability of the examiners Test theory tells us that the coefficient of reliability is 
the square of the index of reliability. ( 947)* - .89, which, as we saw earlier, was 
the median inter-r of the 49 examiners. 
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Ansvcf Correct 

book mark 


Ong nal 

examiner j Difference 
mark 


D 

E 

C 

1 

r 

A 

B 

II 

C 

J 

Means 


270 
231 
190 
15 5 
154 
14 5 
112 
109 
98 


+10 
-11 
-1 0 
- 3.5 
-2 4 
+2 5 
+3 8 
-09 
-0 8 


15.5 151 '° 4 

Average d fTerence 1 9 
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obtained from this sample suggests that it was a good sample of the 
90 experienced examiners. 

The first factor analysis was based on the inter-correlations 01 tnc 
10 answer books. These inter-correlations ranged from + .151 (between 
answer books D and J) to +.828 (between answer books F and G), 
with a median r=.568. It is interesting to note that the lowest correlation 
was between the highest and lowest ranking answer books. This suggests 
that they differed not only in merit but in other qualities also. But the 
highest correlation was not between answer books adjacent in merit, 
but between the candidates ranked 5th and 9th. These two seem to be 
viewed in the same order of merit by alt 49 examiners. 

The first unrotated factor accounted for 59.01% of the variance. 
The second factor accounted for 11.80% or the variance. The third 
factor accounted for 9.85% of the variance. The fourth and fifth factors 
accounted for 5.45% and 4.16% respectively, and the last four factors 
2.58%, 2.30%, 2.23%, and 1.47% respectively. 

Seven of the answer books had negatise loadings on one or more 
factors. 

No attempt has been made to interpret these factors. This would 
be an entire study in itself, involving subjective ratings of the answer 
books by experts on several hypothetical variables (e.g. order of merit, 
handwriting, organization, style, grammar, spelling, etc., etc.) and 
possibly a re-analysis including scores for these variables. 

What we can conclude is this : The process of marking is obviously 
a complex one, and a mark does not mean the same thing for different 
candidates. The major factor is presumably “knowledge of history", but 
a surprisingly large 41% of the variance is due to other factors. What 
leads to the raising of the mark on one answer book may be irrelevant 
in a second, and may even lead to the lowering of the mark in a third 
(hence the negative loadings). 

We hesitated to report this, because such a partial report (without 
the trappings of correlation and factor matrices, factor leadings, etc.) 
may be frustrating to the expert. (The figures are available to anyone 
who asks for them.) But we felt that partial information is better than 
none at all. The study definitely shows the value of further, more detailed 
research along these lines. 

11. Factor analysis : (2) examiners 

Correlating 49 examiners on only 10 variables is, of course, some- 
what absurd. As such, the results should be considered to be of explora- 
tory . \alue only. This is the kind of exploration which could not possibly 
be justified with only a calculating machine available, but which the 
speed of the computer does make possible. 
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examiners do differ. How they differ is a major research subject in itself. 
12. Position effects 

The experimental design — having half the examiners mark the 
answer books in one order, and the other half mark them in the reverse 
order — was intended to bring out position effects. None were found that 
were statistically significant. Table 8.6(a) makes it evident that there was 
no significant difference between the first-order and second-order marking 
of each answer book. When we pooled data on three answer books — U- 
A, B, and C marked first and marked last — the difference was still not 
significant. Nor was there any consistent tendency for the mark of a 
particular answer book to be raised or lowered by the “contrast effect 
with the answer book which immediately preceded it. 

It is possible, of course, that the investigators did not apply the 
right hypotheses. For those with promising hypotheses, the data arc 
available in the tables. However, note that Taylor et al (1966 b) found 
position effects showing'up only after about 25 of their hundred answer 
scripts had been marked. 
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"Four Thousand Re-examined ” 
— Detailed Reports 


T iHE NEXT five chapters will cover the research on four thousand 
answer books, the work of 44 examiners, II in each of four sub- 
jects The highlights of this study have been given in Chapter IV, 
and Will not be repeated The data and experimental design were dis- 
cussed m Chapter VI 

, Chapter X Will compare the four examinations history, Hindi, 
biology, and mathematics (geometry) 

Chapters XI to XIV discuss the four subjects m detail It is assumed 
that many readers will be interested in only one particular subject There- 
fore, at the risk of seeming repetitious, each chapter is written to be 
complete in itself However, cross references to interesting points jq 
other chapters are occasionally given 

After reading Chapter X (comparing the four examinations), the 
reader may thus read just the one subject most interesting to him, and 
then skip to the Recommendations (beginning with Chapter XV) 

A note on the t-test 

A special note is relevant here on the interpretation of the “signi- 
ficance” stars in the subsequent tables The t-test for the significance 
of a difference between means is supposed not to be valid if the variances 
are not homogeneous Unfortunately, in quite a few cases the difference 
m variances is significant However, we are dealing with correlated 
means, and there seems to be no easy way to handle this problem Since 
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our /’s have been calculated from the distribution of actual differences 
(rather than from the formula involving r), we take them to be reason- 
able estimates. A more exact approach is likely to show a larger number 
of significant differences, rather than a small number. 
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Comparing the Four Examinations 
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the marking done for this experiment The examiners were asked to do 
their re-marking as much as possible under usual annual examination 
conditions To do it exactly the same way, knowing that this was an 
experiment in re-marking, obviously could not have been possible 
Still, so far as our evidence goes, there seems to be no reason to believe 
that examiners were extra careful, or otherwise “non-normal 1 the 
second time 

It should be noted that all marks arc out of 50 The Hindi exam 
had only 34 marks, but all calculations were multiplied by 50/34 to 
make them comparable to the others 


J Mean marks 

Now look at the two lines marked Mean marks O and R These 
are the averages (means) of 20 means, each based on 50 answer books 
The two Arts subjects have roughly the same average marks about 
17 5, or 35% The two science subjects arc higher and approximately 
equal 21 marks out of 50, or 42% These are substantial differences, 
which may or may not be related to any real differences in ability (Sec 
discussion of sampling in Chapter VI ) We know of no studies directly 
comparing marks and ability levels of Arts \s Sciences students However, 
the senior author performed some comparisons in some tables given 
by Liddie (1965) He found that almost every combination of Verbal 
and Quantitative ability predicted higher Science than Arts marks Since 
LiddJe’s thesis was based on an excellent sampling procedure, this strongly 
suggests that students of equal ability will have higher marks if they 
offer Science than if they offer Arts Thus the differences noted here are 
quite probably due at least partly to the use of different scales, and not 
to any real differences in ability between Arts and Science students 

The line labelled Absolute differences gives the average difference 
between the 10 (or 20) averages without regard to sign The following 
artificial example will make this clear 


Mean marks 

(D fferences) 
Absolute differences 


Same 

O 24 00 

R 2000 

-^too 
400 


Off 

Total 

2000 

2200 

24 00 

22 00 

+4 00 

ml 

400 

400 


° f a s,ve " P a,r ot emmets marks an answer book 
.pendently, there is no log, cal reason to call one “first • and the other 
:ond Logically,^ we just have two independent markings And if 
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we decided purely at random winch was ‘first” and which “second.” 
their average differences would end up around zero Therefore, what 
we are interested in is the difference regardless of which is called * first” 
and which “second” This is what “absolute differences” means 

Let us go back a moment We pointed out that only one of the two 
markings was under “real” conditions the other was under, shall we say, 
“simulated real" conditions And we can see from our Tabic 10 1 that 
there was some consistent difference between the two markings Some 
of the difference m means is due to reduction of the ‘J-eflect , i e of the 
piling up of marks just at the pass mark or 33% As can be seen from 
Table 4 5, many of these marks were re-distributed downward, thereby 
lowering the mean However, on the average the re-examiners were also 
a little bit harder on the candidates than the original examiners were 
Perhaps they were being more cautious 7 Or perhaps knowing that these 
marks would not actually be awarded the candidates, they felt they 
could afford to be less lenient ? We do not know the exact reason How- 
ever, wc con see that the differences are small, the largest (biology) being 
only 0 83 marks out of 50* Because of the large N’s (1,000) three of 
these four differences are statistically significant However, it js unlikely 
that they are psychologically (or educationally) significant (Admittedly 
this point is arguable, when a single mark in error can make the difference 
of pass or failure to literally thousands of candidates However, these 
differences are minor compared to the other sources of error in this 


examination system ) 

We now return to our ‘ Absolute differences * For the Totals, 
these are largest Tor history (2 13) and nearly equal for the other three 
subjects There is a rather remarkable degree of agreement in “standard 
of marking” betweeo pairs of examiners, on the average But the city 
of Allahabad has not been saved from flooding just because the average 
rainfall is only one inch per week Similarly, reference to Table II 2, 
122, 13 2 and 14 2 will show that there is a wide variation among exami- 
ners One pair of history examiners differed by an average of 5 
marks (10%) on SO answer books, while one biology examiner disagreed 
with himself by as many as 4 64 marks (over 9%) 

Look at the Absolute differences in the • Same ’ and ' Different” 
columns In biology (partly because of the above-mentioned examiner) 
there was more agreement between different examiners (1 1 9) than between 
the two markings by the same examiner (1 39) In the other three subjects 
examiners agreed more with themselves than with each other as to the 


average merit of the answer books. 

Remember, of co urse, that these are average marks on answer 
* Because they are unimportant these d Seme « are not *ivfn to the oMe ir u* 
renter d«ur« be can eas ly subwet one mean from the ctbet to crtrtaai 
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books. Disagreements on individual marks ran as high as 20, as was 
shown in Table 4.1. 


2 . Standard deviations 

The average standard deviations tell roughly the same story. Perhaps 
to make up for the lowered means, and to avoid penalizing the best 
candidates, three of the four standard deviations were increased some- 
what. Some of this increase may also be due to the same factor as fortfae 
mean, i.e. the reduction of the J-efTect. (In Hindi, which did not increase, 
the J was at the mean.) However, since standard deviations depend on 
the square of the deviation from the mean, a shift of a point very close 
to the mean would not have very much effect. And the fact that this is 
not the only cause of the increased standard deviations is indicated by 
the fact that the percentages in upper divisions, and the percentages 
failed, also changed from Original to Re-marking. 

But the difference between “Same” and “Different” examiners is 
much more striking here than for the mean marks. The “Different” 
examiners had from two to four times as much difference between them 
as the “Same" examiners did. This suggests that, though there is a fair 
degree of agreement in the average standard, there is far less agreement 
among examiners in the awarding of very high and very low marks. 
This may be because only “pass percentages” are published, and given 
much publicity. Thus the examiner has a fairly good guide as to how 
many should pass — but not how many should get Second or First Divi- 
sions — or, conversely, how Jow the lowest should be marked. If this 
hypothesis is true, better instructions to examiners are needed to cure 
this defect. (Every American teacher we have ever met knows what “mark- 
ing on the curve” means, and so do most American high schoolstudcots. 
Thus most American marks tend to be distributed normally, with the 
proportions in each of the various “divisions” being approximately equal 
for all subjects. It is tragic that this very simplest or statistics is apparently 
unknown, or at least unpracticed, in India.) 

The standard deviations of the four subjects tell an interesting 
story — one that has already been mentioned in the Summary section. 
The standard deviation of Mathematics is twice that of any of the other 
three subjects. Mathematics answer books are marked on a different 
scale from other subjects. It is as illogical to compare the marks of mathe- 
matics students with those of other students, as it is to expect a 39-inch 
table and a 39-centimeter table to be of equal length. This has been pointed 
out repeatedly in India (Mahalanobis 1934, Bose 1955, Harper 1963, 
and others). Is it not time to apply a more rational scaling system ? 

Particularly devastating is the fact that it is the standard deviation, 
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reJafrve weight m the total aggregate marks It js this “enlarged b> error 
standard deviation which help^ determine the position of any individual 
student in the ageregate total — not the intentions of anv single examiner, 
or even the average intentions of all the examiners 

Notice also in the above table, that the largest increase in the distri- 
bution (a) is in geometry, which is supposed to have the least subjective 
marking The actual standard deviation of geometry marks is nearly 
three times that of any other subject— thus giving geometry nearly three 
times the weight in the aggregate marks, that any other subject carrie :> 

Is it reallj the intention of those who control the examination system 
that a student's ability in mathematics should be three times as important 
as his ability in an> other subject, in determining his final rank n 

4 Note on use of medians for correlation coefficients 

All of the above figures have been arithmetic means For the correla- 
tion coefficients which follow, medians have been used Partly this is 
because averaging requires translating into z first, which is very time 
consuming when 360 coefficients are involved Partly it is because co- 
efficients based on 50 cases are not highly reliable, and thus “chance” 
variations can be quite large Use of the median ensures that any “chance” 
variations at the extremes will not unduly influence the average 

5 Correlation of no of pages i s marks 

Does the number of pages written influence the number of marks 
awarded 6 7 The evidence that, in general, it does not is discussed more 
fully m a later section (see Table 10 4) It can be seen here that the 
medians are quite low— no higher than they should be on the reason- 
able hypothesis that, after all, the worst student kno*s less than the best, 
and can be expected to write fewer pages 

Note that there seems no particular logical reason to expect the 
correlations for “Original” marking to be different from those for “Re- 
marking” And the medians obtained support this 

6 Mean of absolute differences 

This is put way down here to try to discourage anyone from confusing 
it With the “Absolute differences or means”, given above 

The absolute differences in the section on means were the absolute 
differences between means or averages 

The absolute differences here are the absolute differences between 
inaiudua! marks In other words, we have added up the difference be 
twe-n the first and second marking of each individual answer book. 
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and averaged these They are “absolute differences'' because they are 
without regard to sign As we have pointed out, there is really no logical 
reason to consider one marking ''first” and the other "second ', so a 
difference is a difference, regardless of us direction The following etample 
Mill make this calculation clear 


Candidate 

First Marking 

Second Marking 

Difference 

a 

43 

39 

—4 

b 

37 

35 

— 2 

c 

40 

42 

+2 

6 

25 

29 

+4 

Total 

145 

145 

12 

Mean 

3625 

36 25 

3 


Note that in this example there is no difference between the means 
(36 25), but the mean of absolute differences is 3 This shows the difference 
between the two concepts 

Look at the four “total” columns The largest difference between 
examiners is in history (an average of 3 05 marks out of 50, or 6%) 
and the smallest in mathematics (2 58, or 5%) Notice, however, that 
there is really very little difference m the accuracy of history anil mathe- 
matics marks — in spite of the popular superstition that mathematics 
marks are far more accurate than history marks Of course, if history and 
mathematics were marked on the same scale, then our other evidence 
shows that the mean of absolute differences would be much less for 
mathematics than Car history But as the marks stand, mathematics 
cannot boast any higher stability for ns marks than history 

Now look at the “Same” and “Different” columns In history and 
in mathematics, examiners re marking their own answer books do sub- 
stantially better (2% of marks) than when they arc re marking another's 
answer books For Hindi and biology, examiners do not seem to be 
agreeing much better with themselves than they do with each other, 
so far as this measure is concerned 

Incidentally, do not take comfort in the fact that these differences 
look rather small Remember, these are just averages The average 
temperature of Delhi may be 75°~but this won t save you from the 
possibility or heat stroke tn June Table 4 I shows that some of the 
differences contributing to these averages are as large as 40 per cent of 
marks 


7. Examiner reliability 

Examiner reliability is the product-moment correlation between 
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two examiners marking the same set of answer books. It is the measure 
of agreement between examiners. It is also known as “reader reliability . 

We found a range of examiner reliabilities (both self-consistency 
and inter-examiner consistency), of course. Perhaps the surprising thing 
is that the range was not larger. Here are the figures for the range of 
correlations, when the same and when a different examiner re-marked 
each batch of 50 answer books: 



Same 

Different 

History 

.72 to 93 

.49 to .85 

Hindi 

61 to .94 

.66 to .88 

Biology 

.75 to .91 

.67 to 88 

Mathematics 

81 to .99 

92 .o .97 


We noted in the last section that Hindi and biology examiners 
do not agree much better with themselves than they agree with each 
other— while for history and mathematics examiners there is a subs- 
tantial difference. Exactly the same picture is shown by the examiner 
reliability coefficients. But to see this, we must translate the coefficients 
into Fisher’s z, because a difference of .03 between .95 and .98 is actually 
50% larger than a difference of .11 between .74 and .85. 

Subject History Hindi Biology Mathematics 

r or s r r r z r z r z 

Same 85 1256 .81 1.127 .83 1.188 .98 2 298 

Different .74 950 .78 1 045 .80 1 099 .95 1 832 

Difference .306 .082 .089 .466 

Thus we sec that mathematicians agree more highly with themselves 
than with each other in marking papers, and so do historians.* But for 
Hindi and biology the difference between “self” and “other” agreements 
(i e. between the self-consistency and inter-examiner consistency types 
of reliability) is negligible. 

Why these differences ? The real reasons could be known only with 
further research; however, it may be or some value to speculate. We 
can rule out memory as a factor, because of (a) the substantial number 
of answer books and (b) the time interval of nearly nine months A 
more reasonable hypothesis is in “styles of marking”. The data seem 
to suggest (see next section, also chapter VIII) that different examiners 
are looking for different things, and that these differences are consistent 

* The statistical significance of these two differences is beyond the 1% level of con- 
fidence. However the educational significance of the differences is probably negligible 
The number of students whose class or divisions would be different for Same and for 
Other re-marking is small 
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traits for examiners For example, one historian may be more impressed 
with accurate dates another may overlook an error in date if the candi- 
date is approximately correct and shows a rea! understanding of the era 
and events involved Mathematicians may give more or less credit for 
certain aspects of partially correct answers The data suggest that perhaps 
there is wider agreement in Hindi and biology about what should and 
should not be credited than m the other two subjects Better instructions 
to examiners can reduce these differences 

The “Total ' columns tell a different sort of story Though math: 
maticians may agree better with themselves than with others they do 
agree with each other (95) much better than other subject-examiners 
agree even with themselves ( 85 81, 83) Thus examiner reliability for 
history, Hindi, and biology seems to be about equal— but for mathema 
tics it is definitely better Of course even for mathematics it is far 
below the “examiner reliability* of objective examinations, as the latter 
is l 00 In terms of z, the difference between reliabilities of 96 and 9965 
(which is the closest to I 00 that the table comes) is larger than the diffe 
rence between reliabilities of 0 and 80 

These examiner reliabilities are in line with those found in Western 
studies — except, of course, for those studies which are based on special 
and intensive training of examiners, and greatly improved marking 
systems There is a general belief that Indian examiners spend less time 
per paper than in the West (which may be true) and that they arc therefore 
more careless and less reliable The latter, apparently, is not true Indian 
examiners are apparently no less reliable than Western examiners Bat 
traditional examinations in themselves arc unreliable 

Thus we cannot take much comfort in the reader reliabilities which 
we have found for history, Hindi and biology Gulhksen (1950 p 212) 
remarks "In general we should strive for a reader reliability over 90 
if it is possible to achieve this level It would seem that a reader reliability 
of less than 80 is so low as to necessitate further discussion and alteration 
m methods of reading ” Only one-quarter of our inter-examiner consis 
tency figures were above 90 (and those mosliy in mathematics) while 
nearly one half (19 out of 40) were below the 80 minimum 


8 Content reliability 

Not only arc these examiner reliabilities inadequate— they alio can 
be misleading As we noted m the last chapter, coefficients of correlation 
take no account of differences m m*an and standard evialion us 
one examiner could fail an enure b«ch of Candidas and Ihr P>“ 
all or them w,th high nurks-y« their esamm-r reliability cautdb-ss 
high as I 00, if they both put the answer booU in th- jame ord-r ol m ru 
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This factor could, of course, be taken care of by scaling the examina- 
tion marks— as is recommended several tunes in this book A more serious 
problem, therefore, is the fact that all too often examiner reliabilities of 
essay-type examinations are talked about as though they were the same 
as objective-examination reliabilities It has often been shown that exa- 
miner reliability can be substantially improved by more carefully defined 
marking standards, better techniques, better training of examiners, etc 
(See Basumallik, 1959, for an excellent review) The assumption often 
seems to be — perhaps unconsciously — that “If we can raise examiner 
reliability to 90, the essay examination will be as reliable as an objective 
examination*’ Nothing could be further from the truth Though an essay 
examination cannot be reliable without high examiner reliability, high 
examiner reliability m itself does not make an essay examination reliable 
Remember that objective-type examinations have an examiner reliability 
of 1 00 Even an utterly unreliable objective-type examination (we once 
actually sav, one whose odd-even reliability was 0*) still has an examiner 
reliability of 1 00 

The coefficient for an essay examination, which must be compared 
with objective examination reliability, is what Gulliksen (1950) calls 
“content reliability”. The reliability coefficient for objective examinations 
is based on the contents of the examination, not on agreement between 
examiners The same must be required of essay-type examinations 

The various ways of obtaining the coefficient of content reliability 
have already been discussed m Chapter VII An abstract of Harper’s 
adaptation or Ebel’s intra-class correlation is given in the Appendix 
It was also noted, in Chapter VII, that this method may produce spuriously 
high content reliabilities, due to the ‘‘halo effect”— an effect which is 
not discouraged by the marking methods generally prevalent m India 
(ie reading through each entire answer book, rather than marking 
Question 1 in all answer books, then Question 2 m all, etc) Thus it 
seems reasonable to consider our obtained content reliabilities as upper 
limits, under current conditions 

The fact that a wide range of reliabilities was obtained for each 
examination, though the question paper was exactly the same in every 
case, indicates that what is being marked, how much relative weight 
is given to various elements, etc , must differ from examiner to examiner. 
The ranges and median content reliabilities were as follows: 


Subject 

Range 

Medan 

History 

21 to 76 

48 

Hindi 

45 to .77 

70 

Biology 

41 to 68 

63 

Mathematic: 

71 to 90 

77 
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h is instructive to compare these history reliabilities with the one 
reported for the saw cxammatioi m the ninety mirking ten exp“timcnt 
A choice of five questions out of ten was allowed the cand date* but 
for that experiment answer hooks were selected in which the same five 
questions had been answered The content rehabii ty for a batch of 
answer boohs in which all candidates had answered the same five ques 
tions was 85 (Chapter Vltl section 8) When different sets of five out 
of ten were answered the rtmimum content reliability drops to 76 
and the median to 48 This supports the recommendation of all exp-rts 
(DFPSE 1963) tint all questions should be compulsory 

We have not found any other content rclablities with which to 
compare our history Hindi and biofogy results Gayen (1961) found 
the reliability of a similar mathematics examination to be 63 Considering 
tlic difference in techniques these coefficients are reasonably close 
Note that for mathemates content rehabii ty is considerably lover 
than examiner reliability— so high reader reliability is not much use 
when the examination method itself is defective This points up the fact 
that the mam advantage of so-called objective tests is not their objectivity 
in marking (That exists in Mathematics also ) The major strength of 
objective examinations is in their much more adequate sampling of the 
candidate s knowledge and understanding of his subject It is here that the 
traditional type of mathematics examination with its very limited 
number of questions falls down completely The elimination of choice 
and options a substantial increase in the number of questions m 
mathematics and the wider use of short answer (or an improved form of 
short notes ) questions m the other subjects could raise these content 
reliabilities significantly 

We have no other reported reliabilities with which to compare 
our results It is interesting that so crucial a quest on as the reliability 
of examinations where a choice of questions is allowed has apparently 
never been tackled before Even Gayen (personal communication 1966) 
side stepped the choice problem by simply Jumping together all options 
into a single score for the group to which they belonged The present wt ter 
considered this question so crucial that he held up this entire research 
project for several months searching for an answer 

One last comment on content re! ability Ho object ne examination 
with reliabilities as tow as these v>outd be considered acceptable for deter 
mining examination marks 

9 Marks reliability 

If we are to compare essay with objective-type exam nations it is 
the reliability of the actual marks themselves which is most important 
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For objective type examinations, this “marks reliability” (also called 
‘ score reliability ’) is identical with “content reliability”, which is why 
v,e have only a single term or name, “reliability” But for traditional 
examinations, the content reliability is reduced or attenuated by the fact 
that two examiners will not give the same marks Thus, marks reliability 
is th e product of examiner reliability and content reliability (This formula 
is derived; from Gulliksen, 1950, pp 211-214 See Chapter VII) Thus, 
if examiner reliability is 90 and content reliability is 80, marks reliability 
will be ( 90) ( 80)= 72 

The reason marks reliability is given in the “Different” column in 
Table 10 1 is this What we are interested in is the reliability when two 
different examiners are marking the test — since very few students have 
the same examiner, and most of them have different examiners There- 
fore, for each pair of examiners, we multiplied the “Diff ’ examiner 
reliability (i e the inter examiner consistency) with the content reliability 
to obtain marks reliability The figures given in Table 10 1 are the medians 
of these 40 calculations* Figure 4 5 plots the individual coefficients 
from which these medians were derived 

None of these reliabilities would be considered adequate for an 
objective examination Not one of these examinations is adequately reliable 
for the use that is made of the results 

JO Comparison with other studies 

A question has been raised about the relationship between the 
“marks reliabilities” found in this study, and the intercorrelations of 
examinations in other studies For example, the Maharashtra Board 
reports correlations of 53 and 52 between the Preliminary Examination 
(held by the schools) and the S S C for History Does not such a result 
call into question our reported reliabilities of 28 to 43 for History** 

It may be mentioned that most such correlations reported jn the 
literature arc quite low Still, the problem of these occasional higher 
ones must be dealt with Similarly, some studies have shown predictive 
xaliditics for objective tests predicting essay exam marks, to be "higher 
than the essaj-exam reliabiliUes that we have reported Can validity be 
higher than reliability? 

The last problem is the easiest to deal with When two examinations 
are correlated, the upper limit of the coefficient is equal to the square 
root of the product of the two reliabilities (Cf the formula for the correc- 
tion for attenuation ) Thus, as the reliability of one of these variables 


•Table D in the Arpcnchx gives all the details which entered into the calculations 
Examiner, Content and Marks Reliability 
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(say, the objective examination) approaches unity, the validity coefficient 
approaches the square root of the lower reliability. Thus an examination 
whose reliability is only .28 can have a correlation approaching .53 with 
a highly reliable examination. 

Secondly, most of the published studies deal with subject marks, 
while ours deals with marks on only a single paper. The reliability of 
a mark which includes two papers will be higher than that^of either 
paper, (except in the unlikely event that the correlation between the 
two papers is zero). 

Thirdly, results are strictly comparable only when the work of a 
single examiner (or of a single pair of examiners) is examined. The effect 
of range on correlation coefficients is well known. On the average, the 
standard deviation of the marks of any single examiner will be less than 
the standard deviation for the examination as a whole— because the 
latter is made up of marks from many examiners, whose standards 
differ. Thus the pooling of results of two or more examiners may arti- 
ficially cither raise or lower the total correlation An example of the former 
case is given in Figure 10. 1. Note that Examiners A and Bare stiff markers 



Fig 10.1 


for their batch of papers (or else they have drawn a poor sample of candj- 
dates) while Examiners C and D are lenient markers for their batch 
(or else have drawn only good candidates) The correlation between each 
pair nf examiners ,s low-bnt when the data are p<»W I””* 
the total correlation for the scatter obart becomes quite high. Stun 
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if the results of two examiners, both with low reliability, arc pooled, it 
is quite possible that the correlation with the criterion may be consider- 
ably higher than the two reliabilities. 

In such cases, the standard errors of measurement and standard 
errors of estimate would probably be a much more stable basis for 
comparison than the correlation coefficients, which are subject to so 
many irrelevant influences. 

Finally, let it again be pointed out that Harper’s reliability formula 
for essay examinations is new. It has not yet been subjected to “the 
test of time”, and in fact has been subjected by only one researcher to 
an empirical confirmation of its validity. It is not impossible that the 
formula may be less accurate than logic, at present, leads us to believe. 


]}. Standard error of measurement 

Neither the student nor the teacher is interested in reliability co- 
efficients. What the student wants to know is, “If I took the examination 
again, and it was marked by a different examiner, would I get the same 
mark?” This is what the last line in Table 10. 1 attempts to answer. 

Reliability coefficients are affected by some factors that have nothing 
to do with reliability. This is because they are correlation coefficients. 
Particularly, they are affected by the range of talent in the group taking 
the test. It is quite possible to construct a test which will have a reliability 
of .95 when administered to children ranging in age from two to sixteen— 
and yet have a reliability of .40 or less for any single age group. The 
standard error of measurement, on the other hand, will remain relatively 
constant regardless of the range of talent examined. Thus, the standard 
error of measurement is generally a far more important and realistic 
indicator of the true reliability of a test in actual practice , than is the co- 
efficient of reliability. (Harper 1959, Varma 1962, and standard statistical 
texts.) 

The standard error or measurement (SEM) is the standard devia- 
tion of error scores around the “true score” for any test. If we administered 
an infinite number of essay examinations to a single student, and had 
them marked by an infinite number of examiners, we might consider 
the average of all these marks the “true mark” for that student. And 
the standard deviation of the infinite number of marks, around this 
true mark, would be the SEM of the examination. 

^ More practically, the SEM enables us to make statements like this: 

In mathematics (see Table 10.1) there are two chances out of three that a 
candidate’s mark will be 5 or less points away from his true mark. There 
is one chance out of six that it will be more than 5 marks too high; and 
one chance out of six that it will be more than 5 marks too low.” 
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Again remember (hat these are average or mean SEM sAs eaa be 

largest SEM (m spite of haV '" g ‘ h ^f“ "ft a mach ete standard 
a difTercnt scale from other examm ^ scalc , ts S EM would 

deviation If mathematics were sc , . sbb t ls a better 

be the smallest of the four this shows c ear J why „ r 

indication of the reliability of actual marks than 

reliability , . . these SEM s are actually minimum 

Another fact must be pointed out These s^ (hat both 

estimates They arc based on the mat ^ an d standard 

examiners of each batch of paper d the differences in M and a 

deviation This we know is not true much larger than 

would inevitably make the standard deviation or erro 
these SEM s in actual practice 

12 Scalier chans for Same cmi n “ h v examiners 

Table IM shows for each o f^four subjects^ ^ 
dilTcr from themselves and hoi sub) . ct 500 answer books 

answer books are marked ag ^ and 500 „ c re re examined by 

were re examined by the ongi ted in Table 4 3) 

a second examiner (These data were P 0 (hc same type of detailed 
The interested reader may wan 4 3 (se c Chapter IV section 4) 
analysis of Table 10 2 as was done of Table ^ J0 2 „ JUS , th e demons 

But perhaps the most striking thing re-examining answe 

tration of the fact that XTeelUXr own firs, markings almost 
books several months later d 8 

as much as they disagree with “choth from pass to 

Also of importance is the coun , these for himself to 

and from fail to pass The “ f h S m an tragedy which can be can 
bring home to himself the extent h examination 

by . he unreliability of the. rad, tionallypc 

i a ftiin pf QOOKS 

13 Six examiners compared an tUm k .ha. some of 

There ,s a possib.lt, fact that 

the results shown thus for “ j^nt batch of candidate 5 ^“ge ab.1 U 
receives answer books fr ant , thcr ab ,l ty both * sum „ p 

we know may differ from ^ s(jurce ol error-an ^ a special 
and range To counter this P< ^ following proced ^ of one 
the study more P" 0 ' 5 * drW 30 o answer books 6"»® These 

part of the total study fExawiner X) e w hich were as 

specially experienced exam®* sets „f 5° each which 

non answer books were divided 
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nearly identical as possible m , * 
in this project, they ' ***£ " 

■sr^rSSssss 

means and differences in marks while the former concentrat 

ranges of marks awarded equivalent answer books, W 

different examiners. For a more stable fi S u ^ c tha " a little 

the standard deviation was used. Thus each bar actutd y rf 

less than half the range.' Figure 4.2 was arranged m order 
deviation, while Table 10.3 is arranged in order or mean. 

In Table 10.3 you will find the result for Examiner X (the special y 
experienced examiner who originally markrf all 300 answer booksj 
listed first in each section. You will then find the results of the sur^K 
examiners, listed in order of average marks. Means and ; standar 
lions are given. The third column shows the average difference betwK 
the two marks awarded, one by Examiner X and the other by on 
six re-exam iners.f And the last column shows the single largest differe , 

^TTwreader should not take much comTort in the fact that the “average 
difference" (mean of differences) is so much smaller than the ‘ large 
difference”. Averages are misleading. We must not forget the man 
was being tortured by the Gestapo. They had his bead in an oven an 
his feet on a cake of ice. When asked how he felt, he replied, On 
average, 1 feel fine.” One psychologist, in Tact, strongly advised us IM 
we should not report the average differences at all. It is the largest dillc- 
rcnces which, after all, are far more important. Remember that eac 
“largest difference” is the change in the marks of one answer book ou 
of 50. This large difference represents, then, the minimum difference m 
marks for /wo per cent of all answer books for these two examiners. 
Considering that an examiner may mark from 400 to 500 answer boo s. 


•Each bar included ±1 a from the mean For N=50, rang el a averages 4.5. (Guijfor , 
1956, page 43). Therefore each bar was about 44% of the range, though it includ 
68% of the cases 

Kote that this “mean of the differences” is not the same as the difference betwom 
means. As was explained in more detail earlier (see “Mean of absolute differences 7. 
this was obtained by adding up all differences without regard to sign. For examp c > 
a difference of 3 was counted as “+3” regardless of whether the second mark wa* 
higher or lower than the first. But the differences where the second mark is higher 
and those where the second is lower tend to cancel each other out to some ex ten , 
making the “difference between the means” lower than the “mean of the dine- 
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lirs error of If marks represents the minummtf error for e,gh, u un of 
ks candidates If year son has fatled instead of be,n S awarded a Second 
Class (which is tthat an error of esen 7 marks out of 50 can mean) ,t 
will not make you any happier to be assured that the aierage error is 
on tv 2 or 3 marksl 


The last line of each section in Table 10 3 gives the range from 
highest to lowest value m each column, j e the difference between the 
highest and lowest means and SD’s These figures are good indicators 
of the amount of subjective judgement that goes into fixing the “standard ’ 
in any subject — or, rather, the extent to which examiners’ subjective 
judgements of ‘ standard” difTer 

History shows the largest variation in averages (Table 10 3 gives 
it as 7 44, and you can see how they jump around in Figure 4 2) with 
Hindi and biology next However, biology shows by far the largest 
variation in standard deviations (3 577) Comparing these with Hindi 
and geometry, suggests that the variable standards in history and biology 
are due to the greater amount of judgement involved m marking As 
every lawyer knows, human judgement is highly fallible Any newspaper 
reader knows how often the carefully considered judgement of one court 
is overthrown by a higher court Many of the techniques of modern 
examination methods are attempts to reduce the effect of this judgement 
factor 

But when we look at the differences between marks— both average 
and largest differences— geometry is no better than history Here in a 
supposedly exact science we find nearly as wide differences as in history 
A part of the reason for this is that mathematics examiners throw away 
the advantage they have m more precise marking by using a method or 
scale of marking quite different from that used in any other subject 
For the judgement used in adjusting marks in other subjects the mathe 
maticians substitute simple arithmetic For example, they ignore such 
things as the difficulty of the question, the expected ability level of the 
group etc , and award marks quite mechanically This results in a diffe 
rent scale for mathematics marks * (This is strikingly evident in Figure 4 2, 
a nd even more so m Figure A A) Comparing marks m biology and mathe- 
matics is like comparing degrees Fahrenheit with degrees Centigrade 
ft The highest mark observed in a sample tends to be an under-estimate of the corres- 
ponding highest mark in the population a comment which does not apply to the 
Mean or the S D of marks observed m the sample Gaycn (1962) p 31 
* We seem to be arguing for less judgement m history and more judgement in mathe 
matics Let not the reader be confused by the apparent contradiction We are involved 
here with two different kinds of judgement One is (he judgement on the basis of 
which candidates are pheed in rank order (wh ch is the bare essential of marking > 

This judgement factor should be reduced and roadv as objective as possible, rhe 
second type of judgement however is the decision as to how many marks m award 
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and assuming that, both being “degrees”, they should represent equal 
amounts of heat (This subject is dealt with in greater detail later, in the 
chapter on ‘ Scaling ’ ) 

It is interesting to compare our one “highly experienced examiner 
(Examiner X) with the average of the 6 examiners who re-marked his 
answer books In history, Hindi, and biology the mean marks of X 
and the mean of the six were almost identical In geometry, the six 
had average marks 2 01 marks below X’s mean But in each subject 
there was more than half a standard deviation difference between X and 
the six, with the difference in history being one SD The differences, then, 
are m the number of high and low Divisions awarded, not in average 
marks 

It is also interesting to ask whether there is any correlation between 
mean and standard deviation In other words, do examiners who “mark 
higher also tend to “distribute more widely”, and vice versa ^ There 
may be some correspondence, but if so it is not high About 3 or 4 out 
of every 6 seem to agree fairly closely in rank on both measures, but 2 or 
3 disagree so widely that they bring the correlation way down The rank 
order correlations range from + 54 to — 20, all of which are well within 
the range that might have occurred just by chance 

Note that in Table 10 3 the Hindi marks are out of 34, while all others 
are out of 50 But in the last Hindi row, and in Figure 4 2 in order to 
make the 4 pictures” comparable, Hindi figures have been multiplied by 
50/34 to make them equivalent to marks out of 50 Thus we can see that 
the results for Hindi are really fairly close to both history and biology 
The Hindi averages are not quite so variable as history, but slightly more 
so than biology The Hindi ranges, however, are less variable than either 
biology, history, or even geometry Apparently the examiners of the Hindi 
Prose paper have a fairly clear idea of what their standard is As we saw 
from Tables 4 1 and 10 2 however, they often differ as to where within that 
standard a particular individual belongs 

We have dealt here with averages, but the reader may also be 
interested in the detailed scatter charts given for each pair of examiners, 
in the following four chapters There, the full extent of inter examiner 
variability will be clearly evident 

In summary, therefore we see that even when six experienced 
examiners are gi\en Mrtuatfy identical batches of answer books, in history, 

a given rank It is here that quite wrongly , the mathematics teachers of India have 
abdicated their right of judgement white those of most other subjects have retained 
it The way in which judgement can be applied to the scaling of objectively-determined 
rankings is described by Gull ksen (1950) pages 265 266 (This has been adapted to 
Indian conditions by Harper (1963)) Of course, in certain circumstances there are 
purely objective non judgemental methods for scaling also 
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Marks 
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Mean 

* 

Mean of 
d fierences 
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d fTerenc 

HISTORY 
(50 marks) 

Examiner X (standard) 

E 

I 

F 

J 

G 

H 

18 10 

14 6’* 

15 98 

19 46 

19 82 
’0 26 

22 06** 

3 814* 

4 899 

3 962 

5 100 
5133 

4 330 

5 464** 

3 92 

294 

2 84* 

3M 

2 92 

4 52** 

10 

8 

10 

11 

7 

16** 


Range-hghestto lowest 


744 

1650 



HIND 1 ^ 
(34 marks) 

Exam ner X (standard) 

F 

H 

G 

E 

J 

1184 

10 68* 

II 84 

1156 

1198 

12.34 

13 24** 

3 742** 

2.672* 

3 540 
3-232 

2 753 

3 439 

3 314 

2.04 

144 

136* 

200 

14’ 

2 32** 

7 

4 

7 

6 

4 

8** 


Range-h.gliest to lovest 

(Range out of 50 marks) 


2.56 
(3 76) 

1070 
(l 574) 

(3 41**) 

(12**) 


BIOLOGY E»m"«X( M " tod) 
(50 marks) 


2042 

19 34* 
2004 

20 06 
20 68 
20 74 
22 36** 


4 540 

4 260 

7 411** 

5 270 
3 834* 
5114 
5 309 


2 56 

3 86** 
2 36 

1 70* 
2.34 
310 
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14 Does the number of pages written influence marks'! 

One old superstition can probably be laid to rest on the basis of these 

data 

An American college magazine once published a picture of a student 
dressed as a professor, demonstrating the method of marking answer 
books The “professor was pictured standing at the top of a Jong flight 
of stairs He had tossed the entire pile of answer books down the stairs 
Obviously those with the largest number of pages were the heaviest, 
and reached the bottom These received the highest marks But the answer 
books with very few written pages were too light to travel far They 
stayed at the top of the stairs, and received the lowest marks Indian 
and American students share the conviction th3t many examiners just 
count the number of pages, and award marks accordingly 
Is this true ? 

Table 104 examines this question in detail The number of pages 
written in each answer book was counted and this number was correlated 
with the marks awarded, in both the first and second marking of each 
answer book Tabic 104 reports these product moment coefficients 

First, we need some sort of a hypothesis Should we expect * by 
chance” no relationship between the number of pages written and the 
number of marks awarded^ This hardly seems reasonable, at least not 
for the extreme cases A student who knows very htUe will be unable to 
write more than a few pages, a student who knows a great deal will tend 
to write a fairly large number of pages to include all he know* But m 
between, of course, are the students who know a little but ’ pjd it out 
to fill many pages, and the students who know a fair amount, but write 
with brevity Thus, it seems reasonable to expect some relationship 
between number of pages written and number of marks awarded but 
not a high relationship 

This is just what the data show 

We can say that an examiner is influenced by the number of pjges 
if the correlation of his marks with pages is ' significantly higher than 
we should expect ’ To do this we need some base hypothesis as to the 
correlation expected In the absence of any better hypothesis it scents 
reasonable to take the med.an value of all correlation coefficients for 
a particular subject, as the “expected ’ or * reasonable value for that 
subject Then we can examine whether any particular examiner s correla- 
tion coefficients differ significantly from that value 

In Table 104 we find (*) to indicate ‘significant ddTrrrsfo P - 



176 


RESEARCH ON EXAMINATIONS IN INDIA 


level of confidence), and (**) to indicate “highly significant” in the positive 
(i e. above expectation) direction. In other words, perhaps 10% of the 
time we have a suspiciously high correlation between pages and marks. 

But let us look further. 

(1) It seems reasonable to assume that if an examiner is influenced, 
he is likely to be influenced on all batches of answer books that he 
examines. (2) We have no evidence against the proposition that in a parti- 
cular batch of answer books, there may be a higher “reasonable 
relationship (due to the special quality of those candidates) than exists 
for the average. 

Proposition (2) may be re-stated to the effect that correlation or pages 
with marks may be a function of the batch of answer books (rather 
than of the examiner). The bottom section of each of the four columns 
seems to support this hypothesis. Look at the pairs where Examiner X 
ts paired with Examiner E, F, etc., through J. (Remember that 300 answer 
books of Examiner X were divided into six batches of 50 each, with 
approximately the same means and standard deviations. Also note, 
of course, that “Examiner X” is a different examiner for each of the four 
subjects. This is true of the other examiners also) Thus six samples of 
each Examiner X’s answer books were re-marked by six different exami- 
ners. But a look at the pairs of marks will show that there is definitely 
a correlation between them. In the History column, for example, .52 
is paired with .36, then .18 is paired with .12. Thus we may conclude that 
correlation between number of pages and marks is frequently the function 
of the answer books tliemsehes , rather than of the examiners. (There is 
another possible explanation It just might be that both examiners tended 
to be highly influenced by one set of books, and not by another. This 
seems unlikely, when 7 examiners are involved in each subject, a total of 
28 examiners for this part of the analysis). 

So now we go back again to our “highly significant” figures. It seems 
evident that some of them may be due to the answer books rather than 
to the examiners. How can we tell? (1) We can see if any examiner is 
consistently too high (i e. significantly high for all his answer books). 
(2) We can compare pairs of examiners on the same answer books. 

By the first test, we find only two examiners all of whose correlations 
are significant at the 5% level or better: Examiners C and J of history. 
(Only two other examiners come anywhere near “consistency” and both 
of them Examiners C and E of mathematics — are consistently indepen- 
dent of the number of pages ) 

For the second test, the significance of the difference was calculated 
(using z ) for all suspicious pairs of coefficients. Again, History Examiner C 
differed significantly from Examiner D on both batches of tests that were 
marked by both However, history Examiner J did not differ significantly 



COMPARING THE FOUR EXAMINATIONS 


177 


from Examiner X on the one batch of tests that both examined 

It is interesting to note that the median r was lower for history than 
for any other subject Yet it is only in history that we have found any 
evidence of undue influence of number of pages on marks 

So far we have been talking about statistical significance But m this 
case, psychological and educational significance seem to be accurately 
reflected by statistical significance (This is partly because N’s are small, 
so r's and differences need to be large to be statistically significant ) 
Inspection of the figures suggests that few if any of the correlations that 
are not statistically significant could be expected to have any educatioaal 
or psychological significance either, even jf N s were large enough to make 
them statistically significant 

We thus conclude that only mo out of 44 examiners Here sigm* 
flcantly influenced merely by the number of pages written by the candidate 
This is less than 5% Many statistical would, m fact, go a step further. 
They would say that if you find evidence of relationship in less than 5%, 
then this itself may be due merely to “chance ’ and not to any actual 
relationship This certainly would baye been true if the significance had 
been only at the 5% level 

The superstition that one gets a higher mark if one writes a lot, 
regardless of contents, may therefore now be laid at rest And it would be 
a boon to examiners if students stopped believing this old superstition, 
and did not fill up their answer books with so much rubbish! 
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Detailed Statistics— History 


T HE HISTORY paper consisted of 10 questions* out of which each 
candidate was to answer any five In one of the questions, the 
candidate had a choice of 5 out of 8 parts Each question earned 
10 marks for a maximum of 50 marks for the paper 

/ Distribution of marks 

Table 11 l gives the frequency distribution for each batch of 50 
answer books in History These are the detailed frequencies on which 
the totals given in Table 4 5 are based 

Table 11 l A shows the frequency distributions for the first marking 
of each batch of answer books The top row names the original examiner 
of each batch It is his marks which are distributed in this part of the Table 
The hundred answer books of Examiner A were split into two equi 
valent batches of 50 answer books each One of these batches was to be 
re marked by Examiner A himself, the other by Examiner B, as shown 
m the second row of the Table The answer books of Examiners B C 
and D were treated in the same way The reader can see, from this Table 
that the batches were as closely matched as was reasonably possible 

The 300 answer books of Examiner X were split into 6 equivalent 
batches The matching here also is as close as is feasible 

Table U 1-B shows the matks awarded each batch of 50 answer 
books, the second time they were marked Compare for example the 
third pair of columns The two batches of Examiner C s work, which were 
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closely matched ,n the firs, part of the Table are how awarded quite 
different distributions of marks by Examiners C and £ 

The lines across the Tables show the divisions between '^ Classes 

dates were awarded 16 _ta n0 « ans d.sappeats 
16 s were awarded-though ■ ‘ £ and Second Divisions in 

There is also a slight 1 effect between ri a „ interest 

the first marking— but, again, not ■" t e Apparently history 

ing reverse J between II and H m both marking lhey 

examiners are a little reluct “ t ® . ralh „ , ha „ award a 23 
tend to pile up cand.dates wnh 2 K , ^ ^ ac tu a lly 

y.gd^Xhe'marks frorn°37-— 50^ arc^asted—they might ,1. as wel, no, 

exist 

Table 1U summames most of ^y^infmedi^of these 

I^r have already been Z U of 5° bt-J 

answer ^oohsvrtnch'provtd^' the s,a ^ s ^ t^scconiHtme'^hrthird 
row tells who examined the annver^ tht answer books were 

row merely notes (for easier r'f' re,, “' fa[lt om from the original 
re examined by the Same examiwr. , , 0 original marking. 

In the remainder of the Table, 
and "R" to the Re-examining 

5 Mean and standard denation not be taken too 

first hundred ans"“ js n0 IC3 son to believe that h y 
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arbitrary. When the second mean is higher, we have attached a 
when the mean for re-examining is lower, we have attached a ’ ”• 

This seems logical (or at least psychological!). 

The same comments apply to the SD’s (standard deviations)^ A + 
means the second examiner widened out the distribution, a means 
that he attenuated, narrowed or compressed it. 

Note, however, that in a sense the “direction” is really irrelevant. 
“First” and “second” examiners could easily be interchanged. There are 
some consistent features of the “second" examiners — on the average 
they have marked a little lower, and with less J-efTect. However, the far 
more important factor is the actual differences, regardless of “direction’ . 

Differences between means range up to 5.02 (which is 10% of marks). 
Differences in SD range up to 3.061 (or 6% of marks). 

For the difference between means, ’indicates that the difference 
is significant at Ihe 5% level of confidence; **lhat it is significant at 
the 1% level of confidence; and ***that the difference is significant at 
the 0.1% level of confidence. 

The interpretation for the asterisks attached to the SD’s is the same, 
except that, unfortunately, we had no tables for significance at the 0.1% 
level of confidence here. 

We may summarize these data as follows, giving the number of 
differences which are significant at each level of confidence; 

.001 .01 .05 NS 

M 13 4 1 2 

SD XX 6 3 11 

Re-examincr 
Same Diff. 

Neither M nor SD differences are significant 0 0 

Mean difference only is significant 7 4 

SD difference only is significant 1 1 

Both M and differences are significant 2 5 

Remember that if the examiners are reliable (in the broad sense of 
the term), then the differences should be not significant. No batch of 
papers was marked by a pair of examiners whose differences were small 
enough to be attributed only to random chance. All errors were significant 
errors- indicating that factors other than sheer chance were operating 

4. Distribution of Divisions 

Those who do not understand means and standard deviations will 
get a clear picture of what is happening when they look at the next few 
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The self-coitetaions of examines ic-cxamimng their own answer 
booVs ranged from 72 (Examiner I) to 93 (Examiner F) The median 
was 85, indicating a fair degree of agreement This compares fasourabiy 
with such studies done in other countries 

But when examiners marked each other's answer books, the product- 
moment coefficient of correlation dropped as low as 49 (D— C) The 
highest correlation between two history examiners was 85, which ts the 
same as the median self-correlation 

It will be remembered that the previous study of 90 examiners marking 
the same 10 history answer books, estimated the average rank order 

inter-correlation as 83 Why does the present study show somewhat lower 

inter-correlations (median= 74)*^ We do not know One possible reason 
may be that, while in the first study we selected only answer books in 
which all candidates had answered the same five questions, in this study 
there was no such restriction It is well known that allowing a choice of 
questions reduces the reliability and validity of an examination It is 
possible that it may reduce reliability of marking, also 

But even the highest self-correlation, .93 for Examiner F, docs not 
look so good when you examine the differences in M and SD (1 08 and 
413) — or the differences in the Divisions and Pass Percentage (12%)— 
for this examiner This clearly illustrates the fact that the correlation 
coefficient alone is misleading, as it takes no account of differences in 
M and SD 

7 Content reliability 

The calculations for content reliability (see Chapter VII) were done 
on the basis of the original marks awarded each batch of answer books * 

* An interesting exception to (his must be noted in the case of examiner D The cat 
culated content reliability for this examiner s 100 answer books is —1 182 Obviously 
this is absurd being both negative and more than unity Furthermore the examiner 
reliability of 494 indicates that there must be at least a minimum of content reliability 
On investigation this odd result was found to be due to the abnormally small SD 
(about 2 marks out of 50) for this examiner We therefore based our estimate of content 
reliability on two other calculations 

D re-marking his own answer books .259 

C re-marking D s answer books 775 

(Note that C*s result here u slightly higher than the reliability of his own first marking 
755— though the difference may not be significant) These two reliabilities were trans 
lated into Fisher s z averaged and the mean z translated back into r— 570 
The skeptical may think this negative reliability invalidates the formula used Not so 
An objective examination of 36 questions (equivalent to the range of marks actually 
used by history examiners) a mean of 18 74 and a SD of 1 96, would have a Kuder 
Richardson Formula 21 reliability of —1 38 (If we take number of questions as 50 
the reliability becomes —209) These cases of negative reliability above unity would 
not lead us io throw out KR 21 — we would only throw out the examination’ 
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(Note that the batches for A B C D are 100 each for H, F, G H. I J 
are 50 each, and for Examiner Xthe number of answer books is 300 ) 

The fact that there is a wide variety of content reliabilities for batches 
of answer books from the same examination needs some explanation 
In the first place, some of these differences may be due to different choices 
of questions However probably more important is each examiners 
style or method of marking Thus two examiners may expect two different 
kinds of answers (not completely different, of course but different in 
significant aspects) from the same question In a sense therefore we 
do not really have a single examination — we ha\e as many examination 
papers as ne htne examiners So students sitting for the same examination 
are really sitting for different examinations — and because the student 
knows nothing about his examiner, fie does not know which examination 
he is sitting for This is in line with Taylor s remark (which repeats in 
India similar studies done in the West), that the candidate s mark depends 
more on the luck of who his examiner is than on his actual merit. 

Content reliabilities range from 21 (for H) to 76 (for C) But even 
this highest value would be considered grossly inadequate m an objective 
examination 

8 Marks reliability 

As has been explained earlier (Chapter VII) the only fair comparison 
between traditional and objective examinations is on the basis of what 
we have termed “marks reliability This is the product of examiner 
and content reliability We have calculated this only m the cases where a 
different examiner has re marked the answer books 4 Marks reliability ’ 
attempts to estimate the correlation between the marks of a hatch of 
candidates who have answered examination I marked by examiner P, 
and examination II marked by examiner Q 

Marks reliabilities range from a low of 28 to a high of 43 Even this 
highest reliability is not high enough to estimate the class or group average 
adequately — let alone estimate the true ment of any individual 

Of course, the reliability of the history mark based on mo papers 
B somewhat higher than the reliability or a single paper However, the 
difference is not enough to justify-the heavy reliance placed on history 
marks 

In a sentence History marks are unreliable 

9 Standard error of measurement (SEAf) 

The SEM ranges from 4 48 to 3 0! This means that even for the 
most aenrate batch, there ate only two chances in three that a given 
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candidate's mark will be within 3 out or 50 of his "true” mark. For 
about 5 candidates in every 100, the error will be 12% of marks or more, 
for this best batch. But as has been pointed out, our concern should be 
for the largest (not the smallest) errors that can occur. A'SEM of 4.48 
means that for one candidate in each hundred (or 1,000 in every lakh), 
the error will be 23% of marks or more. And whether the mark awarded 
will be 23% above or 23% below the “true” mark is a matter of sheer 
chance. 

But note that these estimates of errors take into account only the 
correlation — not the differences in M and SD. Thus they would be true 
only if the marks of all examiners were statistically scaled to a common 
standard. More exactly, each SEM shows the range of error which would 
exist if the re-marking by the second examiner were sealed to the mean 
and standard deviation of the original marking of the original examiner. 
The actual range of error, under current unsealed examination marking 
methods, would probably be much greater (see discussion in Chapter 
VU). 

10 Differences 

The last section of Table 11.2 presents the frequency with which 
each difference between examiners of a single answer book appeared. 
Note that this does not take into account the content reliability (as the 
SEM does), but only the actual difference between two markers of the 
same answer book. If each candidate had answered two parallel examina- 
tions, marked by different examiners, these differences would be much 
larger. 

The largest difference for the same examiner re-examining 50 of 
his own answer books was 10 marks. There were three such differences. 
The details of one of them (Examiner C) are as follows: 


Question 

First 

Marking 

Second 

Marking 

I 

4 

6 

2 

i 

4 

3 

s 

8 

4 

6 

8 

9 

2 

21 


— 

— 


19 

281=2 
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The largest difference for a pair or examiners was between Examiners 
X and H, and was 16 marks The details are as follows 


Question 


First 

Marking 


Second 

Marking 


4 3 6 

10 * f_ 

200 out of every lakh of Candida es ^ examples of 

Note that by chance the ™ often, the second 
the second examiner raising the marks But 

examiner lowered the marks n Marl)r a , random, 

U is also instructive to look at a case, 
of ••agreement” between two examiners 

First Second 

Question Marking Marking 


171 


H 

171 


We will let the reader draw Us ow i ^wtom’ 

(Examiners C and D) - '££S£ 

fatth in such averages r ncn ds remonstrated, he a __ 0 ^ ut }, e 

not the averages Even 
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were different in the two markings There is little comfort in the "average’ , 
when the examination can be so very unjust to so many 


II Scatter charts 

Table 11 3 compares the work of Examiners A and B on their own 
and on each other’s answer books It also compares the re-marking or 
examiners C and D with each other 

These and Table 114 are the individual scatter charts which are 
summarized in Table 4 3 

For example, look at the first chart, showing what happened when 
Examiner A re examined his own 50 answer books Out of 29 that he 
awarded a Third Division the first time (line 3 and column T, which 
means Total), the second time he reduced 7 of them to Fail, raised 6 
to a Second Division, and awarded one a First Conversely, of the three 
who were given a First Division in the second marking, only one had 
been a First originally, and the other two were a Second and a Third 
Yet this is an examiner with the relatively high self-correlation of 89 
The little dashes (~) indicate zero cases at the points which would 
show perfect agreement on Division They are to help the eye to find the 
diagonal of perfect agreement, in each chart 

At the bottom of the chart we also find the words, “Agree=64%’ 
This refers to the fact that 64% of the candidates received the same Class 
or Division m both markings by Examiner A 

The remaining charts may be read m the same way Note that the 
top row of four charts in Table 1 1 3 show the self agreements of the four 
examiners, the bottom row shows their agreements with each other. 

In the bottom row, it is interesting to note that Examiners C and 
D seem to agree with each other much less than examiners A and B 


agree This may be partly due to different styles or concepts of marking 
Especially note the i ery narrow distribution of Examiner D’s marks, 
both times with his own answer books, and also when re examining 
Examiner Cs answer books Where C distributes his groups from I to 
Fail, D uses only two categories 

This is a rather striking illustration of the fact that the distribution 


of marks (and, hence, number in each Division) is often more a charac- 
teristic of the examiner than of the batch of answer books examined 
For the same batches, C distributes widely, D narrowly Thus the student's 
Division depends more on luck than on merit It might also be added 
lhat the lower setf-correlations of C and D suggest that they may be 
somewhat more careless (less self-consistent) than Examiners A or B * 

"> ,h, " k ““ D *1^ self-correlation n due to • mine 
lion of ran E e , since he uses only two categories But the ’ restriction of range effect ’ 




A difference of 40 is significant at 5% Icie! of confidence 
A difference of J3 is s gn ficant at 1% level of confidence 
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Do examiners agree with themsclses any better than they agree with 
each other’ AU r’s «re translated mto Fisher's r, and the differences 
" lb last two lines of Table 11 3 give the result Crammer, 

- 

^L G e'n^' 93 ), sprcad a bo,h h,s 1, and m candidates oser three Do, 
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Out of- fourteen comparisons between correlation with self and 
with another examiner, the differences for one-half (i.e. 7) 
ficant, two of them being highly significant. In only one instance : was 
the correlation with Other higher than with Seir, and that was not sig 
ficant This is one more bit or evidence that each history examiner has 
his own special style of examining, and that there is not a high degree ol 
agreement among examiners as to what is and is not important. 

The interested reader may make other detailed comparisons between 
scatter charts for himself. 


12. Self-consistency of examiners ’ distributions 

We have already discussed the product-moment correlation as a 
measure of self-consistency. Another measure is the extent to which 
examiners tend to agree with themselves in M, SD, and percentages of 
marks in each Division. 

We get a more critical picture of this when we examine each pair 
of examiners. Take, for example. Examiners A and B. Remember the 
experimental design; The hundred answer books originally marked by 
Examiner A were split into two equivalent batches of 50 each. Similarly, 
B’s 100 answer books were split into two equivalent batches. Fifty 
of Examiner A’s and 50 of Examiner B’s answer books were then mixed 
up in a random order, and sent to Examiner A for re-marking. Similarly, 
B received a mixed batch of both his own and A’s answer books. The 
examiners were told only that “some of the books” would be answer 
books originally marked by themselves, some originally marked by others. 
In response to a questionnaire, examiners assured us that they were 
generally not aware of which books were which. 


Here is the comparison 

of A and B (from Table 11.2): 

A 

B 

A 

B 

19 1 

155 

5 30 

5 24 

Mean 

SD 
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15 8 

6.36 

5 38 


In the first row, the 50 answer books to which A originally awarded 
mean marks of 19.1 were awarded mean marks of 15.5 by Examiner B. 
The second row must be read backwards. The batch to which B originally 
awarded a mean of 15 8, were awarded mean marks of 16 9 by A. Thus 
for both batches, B’s mean is consistently lower than A’s. Similarly, 
B’s SD is consistently smaller than A’s. 
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Here is the same compamon for C and D, who were also a matched 
pair m the experimental design 

CD CD 

199 149 4 °0 3 35 

Mean SD 

t 189 187 502 196 


Means when 
re-marking 

SDs when 
re marking 

Self 

X 

Self 

X 

127 

146 

4 49 

490 

166 

19 5 

4 97 

5 10 

168 

20 3 

464 

4 33 

17 6 

221 

5 52 

5 46 

15 2 

160 

3 42 

3 96 

19 3 

19 8 

5 80 

5 13 


matched groups of answer bools Note, p 

and I re-exam, n, ng Xs answer „„ d , ffer f m h cansMnty 

All or these facts suggest t Eac h „ a nuner applies l„s 

from each other m ' T ,lae° marks should be how mam shou d 
own concept of what thearerag^ma* ^ ^ b3 , ch of answer bools 

- r i, tszxz — 5 : 

Srau'SS" 01, - PraC ““ “ 

Countries ) 

13 Item analysts ihat any examiner ran carry 
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Merely select a random group of answer books (say 100), arrange in order 
of merit, divide them into several equal groups, and then tabulate how 
well each of these groups did on each question 

We divided our thousand History answer books into five groups 
of 200 answer books each, on the basis of the first marking Table 1 1 5 
presents the results 

We had agreed to keep secret the identity of the Board supplying 
these data Unfortunately, this precludes our publishing the history 
question paper However, its organization can be revealed, as it is fairly 
typical 

There were 10 questions, each question carrying 10 marks The 
candidate was to answer any 5 out of the 10 Maximum marks were 
thus 50 Question 9 had 8 parts, of which the candidate (if he chose Ques- 
tion 9) was to answer any 5 parts, for 2 marks each 

Table 11 5 gives the number of candidates of each ability level 
choosing each question, and their mean marks on that question For 
example, the average marks of the middle group were 17 37 (see right- 
hand column) One hundred and ninety-seven of these 200 candidates 
chose to answer Question 1 The average marks of these candidates on 
Question 1 were 4 34 On the other hand, candidates of equal ability 
(as judged by total marks) who chose Question 7 were awarded an average 
of only 1 96 marks Thus the average ability candidate lucky enough 
to choose Question 1 rather than Question 7 is awarded (on the average) 
2 4 extra marks— and even the candidate choosing Question 9 gets 2 3 
extra marks for his choice 

Gayen (1961, pp 51-53) has emphasized the fact that a candidate’s 
total marks are determined partly by his choice of questions This can 
be illustrated for the top-scoring group (top row of Table 115), by 
dividing the 10 questions into two sets of* choices ’ of five each 


Question 

Marks 

Question 

Marks 

9 

58 

2 

40 

10 

51 

4 

40 

l 

49 

6 

40 

3 

45 

8 

36 

7 

44 

5 

32 


24 7 
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Thus candidates of equal ability can get different marks just by 
happening to choose one or another set of questions We have argued 
elsewhere that the examiner who allows a choice of question, is stating 
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that he believes the questions to be of equal difficulty Since this ts very 
obviously not true, and since the candidate has no way of knowing which 
questions the examiners will mark “sliOly” and which “easily f the tradi- 
tional essay type examination ends up as a sort of guessing game The 
student who is lucky enough to make the right guesses gets higher marks 
than he deserves— the poor fellow who guesses wrong suffers unjustly 
Note, in Tabic 11 5, that the questions differ quite widely in “popu- 
larity ’ Hardly any candidates selected Question 8 — practically all of 
them selected Questions 1 and 3 

Most of the questions show a fair degree of relationship with total 
marks But Question 8 is an excellent example of negatnc validity — the 
poorer the student (ns judged by total marks), the better he does on 
Question 8 

Question 8 



Mean 

Marks 

Average Marks 
per Question 

Marks an 
Question 8 

Top group 

23 464 

(4 7) 

36 

Second level 

19 665 

(3 9) 

20 

Third level 

17 372 

(3 5) 

22 

Fojtlh level 

16118 

(3 2) 

24 

Poorest group 

11062 

(2 2) 

43 


The students who averaged only 1 1 marks at the examination did 
much better on this question than those who averaged 19 7 marks (4} 
marks \s 2) Thus the effect of Question 8 was to raise the marks of 
poor students, and lower the marks of the best, for those who chose it 
The same effect is partly evident in Question 7, which shows a non linear 
relationship This is just one more of the ways in which allowing a choice 
of questions lowers the validity of an examination 

A more sophisticated method or item analysis might have revealed 
these effects in more precise detail This was not done as this was not 
a central concern of this study However, as a side light, the results 
seem sufficiently revealing to justify the effort put into them 
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T HE HINDI paper consisted of 6 parts, all compulsory, with a wide 
selection of questions within each part This paper covered prose 
text grammar, unseen and supplementary boohs It was a three 
hour paper, and maximum marks were 34 In some of the following we 
have reported in terras of 34 marks, in other places in order to compare 
with the other three subjects we have multiplied marks by 50 34 to make 
them equivalent to the 'out of 50 marks ’ scale 

J Distribution of marks 

Table 12 I gives the frequency distribution for each batch of 50 
answer books, in Hindi These arc the detailed frequencies on which 
the totals given in Tabic 4 5 arc based 

Table 12 I A shows the frequency d-stributions fo’ the first marking 
of each batch of answer books The top row names the original examiner 
of each batch It is Ins marks which arc distributed m this part of the 
Tabic 

The hundred answer books of Examiner A were split into two 
equivalent batches of 50 answer books each One of these batches was 
rc marked by Examiner A himself the other by Examiner D as shoxn 
in the second row of the Tabic The answer books of Examiners B, C, and 
D were treated in the same wav The reader can sec from this Table, that 
tbc batches were as closely matched as was reasonably possible Some 
adjustments had to be made to ruke the means and SD» equal. 
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The 300 answer books of Examiner X were split into 6 equivalent 
batches The matching here, also, was as close as is feasible. 

Table 12 1-B shows the marks awarded each batch of 50 answer 
books, the second time they were marked Compare, for example, the 
third pair of columns The two batches of Examiner Cs work, which 
were closely matched m the first part of the Table, arc now awarded 
quite different distributions of marks by Examiner:, C and D A similar 
effect is evident in B’s work when re evaluated by A and B 

The lines across the Tables show the divisions between the Classes 
or Divisions * Notice the striking J-efTect m the find marking (12 1-A) 

A large number of candidates (194) were awarded 11 (passing) but only 
a quarter as many (52) were awarded a failing 10 In re marking (12 1-B) 
this J-efTect is greatly reduced, but by no means disappears Notice 
the very striking effect m Examiner Ds answer books White the first 
time there were 16 IPs and no 10 s in each batch, on the second marking 
more 10’s than ll’s were awarded ’ Examiner H, on the other hand, 

* The dtslribJtior for Hindi is interesting The maximum marks m this paper arc 34 
Thirty three per cent marks is of course, pa .sing This is 12 marks Ye* the J-eTcct 
shows that the examiners are obviously thinking of eleven (rather than twelve) marks 
as passing Thus they are apparently thinkins of 32° 0 narks as be ng the passing level 
on this particular pap-v 

Ho v can we explain this paradox It seem* to be in the dis’ribution of marks in the 
three papers, and the fact that it is the total marks on the three— not marks on any 
single paper— which determines passing The maximum marks on the three Hindi 
papers are 34 34, and 32 Note lhat 33% of these is 1 1 22 11 22, and 10 56 respective 1/ 
Thus if 11 is considered passing on each paper, the total oril-Ml-rll=33 Hence 
the examiner who thinks, * this candidate should just barely pass" will give him 11 
marks (32° 0 ) rather than 12 

This * brinkmanship —which is typ cal or other subjects also— seems rather naive 
to say the least It seems to depend on the assumption that the * should barely pass’* 
candidate will receive the same marks on all papers Yet the probability of this is 
rather low, as various studies have shown that the correlation between papers is 
nowhere near perfect So why not do away with this— and abolish the J -curve m 
marking altogether * The logic is simple The examiner should, we belies c, reason as 
follows Only a few candidates v ill receive the same mark on all papers For most, 
the marks will differ So (in Hindi) instead of racing a 10 to an 11, why not give him 
10 2 If he g-ts higher marks on tbs other papers, he will pass anyway, whs'her I give 
him 10 or 1 1 If he fails on the other papers, he will fad anyway, regardless of whether 
I give him 10 or 11 In the unlikely event that he gets just exactly 11 on each of the 
other two papers, of course be will fail if I grve him only 10 But is there really any 
educational logic in passing him by giving him an extra unearned mark on Paper 1, 
when his total performance is so low anyway’’*’ (Incidentally, India seems to have 
about the lo vest pass marks in the world Some years ago the senior author had the 
privilege or teaching a class at the International Statistical Education Centre, in 
Calcutta. They represented some dozen Asian countries Once he asked them about the 
pass marks in each of their countries They ranged from 60% m the Philippine down 
to 33% for India and Pakistan-who shared the ‘honour” of being at the bottom) 
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FKOUPjBY Second Exammer to Each Batch or 50 Awwt* Books m Hinds 



is us mark, when there were marks awarded above or below thu 
Thu fn certain instances 

] 

! 
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maintains a consistent J in both his original marking (Table 12 1 A) 
PI B) hS 16 markms of both h,s own antl Xs answer books (Tabic 

' , Tb f re 15 aiso a sI, S h{ J efrect between the First and Second Divisions 
m the first marking— but again not in the second (The identical effect 
was noted in history also ) And there is also (as m history) an interesting 
reverse J between II and HI in both markings Apparently Hindi exami 
ners like histoiy examiners are a little reluctant to give a Second Divi 
sion—so they tend to pile up candidates with 14 marks rather than award 
a 15 

Note that only 26 of th" 35 points (34 +0) in the scale are actually 
used The marks from 27 to 34 are wasted — they nu w ht just as well not 
exist In a rational measurement scale full marks should be considered 
attainable 

2 Table 12 2 

Table 12 2 summarizes most of the statistical tables produced by 
the IBM ‘1620 computer The summary means and medians of these 
statistics haxe already been reported in Table 10 1 

The top row tells who originally examined the batch of 50 Hindi 
answer books which provided the statistics in the column The second 
row tells who examined the answer books the second time The third row 
merely notes (for easier reference) whether th«* answer books were re 
marked by the Same examiner, or a Different one from the original 

In the remainder of the Table O refers to the Original markine 
and R to the Re examining 

3 Mean and standard dei latian 

The wide variety of means and SDs should perhaps not be taken 
too senously Though the samples of the work of each examiner are not 
believed to be biased in any significant way they are also not strictly 

4 random The exigencies of the situation required that we take (he first 
hundred answer books that we could find to represent each examiner 
(see Chapter VJ) There is no reason to believe that they were essentially 
different from the rest or the work of that examiner in any manner sigm 
firant to this study However, we cannot assume that they exact!) resemble 
the mean and standard deviation of hts whole batch This fact docs not 
affect the differences between marking and re markine which are the 
prime consideration of this study 

The signs attached to the row of Differences between the means arc 
arbitrary When the second mean is higher, we have attached a 
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when (he mean for re-examining is lower, we have attached a “ — ” 
This seems logical (or at least psychological!) 

The same comments apply to the SDs (standard deviations) A 
means the second examiner widened out the distribution, a “ — ” 
means that he attenuated, narrowed, or compressed it 

Note, however, that in a sense the “direction” is really irrelevant 
“First” and “second” examiners could easily be interchanged There 
are some consistent features of the “second” examiners — on the average 
they have marked a little lower, and with less J-cffccts However, the far 
more important factor is the actual differences, regardless of “direction”. 

Differences between means range up to 3 78, (which ts 11% of 
marks) Differences in SD range up to 1 04 (or 3% of marks) 

For the differences between means, * indicates that the difference 
is significant at the 5% level of confidence, *• that rt is significant at 
the 1% level of confidence, and that the difference is significant at 
the 01% toe! of confidence 

The interpretation for the asterisks attached to the SD is the same, 
except that, unfortunately, we had no tables for significance at the 0 1% 
le\ el of confidence for this statistic 

We may summarize these data as follows, giving the number of 
differences which are significant at each level of confidence 

001 01 05 NS 

M 9 14 6 

SD X X 3 4 13 

Re-Examiner 


Same Diff 

Nenher M nor SD differences arc significant 3 1 

Mean difference only is significant 6 3 

SD difference only is significant 2 

Both M and SD differences are significant I 4 


Remember that if the examiners are reliable (in the broad sense of 
the term) then the differences should be not significant We find four 
cases (3 of self re marking, only one with a different examiner) where 
both of the differences were small enough to be attributed only to random 
chance Note, however, that this alone does not guarantee that both sets 
of marks will be the same The coefficient oF correlation must also be 
taken into account Examiner G, for example, had no significant diffe 
rence in mean or SD between his two markings But the correlation be 
tween the two markings was only -f- 76, producing an average difference 
of 2 00 between individual marks 
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Sateen out of the 20 pairs were significantly different m mean, SD 
or both, indicating that factors other than sheer chance were operating ’ 

4 Distribution of Divisions 

Those who do not understand means and standard deviations will 
get a clear picture of what is happening when they look at the next few 
rows These show how many candidates were awarded each Division, 
by each examiner This is a good way of knowing the subjective standard 
that the examiner has m mind when he marks the answer books 

In each major column, you will find two columns of figures in this 
section These show the number (out of 50) awarded each Division the 
first and second tune, t c by the Original and Re-examiner 

Let us first examine the cases where an examiner re-examines 
his own answer books Look at the seventh column, with D D at the top 
Examiner D awarded two Seconds the first time, none the second, 31 
Thirds the first time, only four the second The first time, Examiner D 
failed only 17 students— the second time he failed 46 Thus the second 
time his mean was lower and his SD was narrower Except for Examiners C 
and G, there nos a mde difference in the number awarded each Division 
by the same examiner, the first and second time he marked his answer boohs 
The differences were even wider when it was a different examiner 
who re-examined the answer books For example, look at the fourth 
column, Examiner A re marking the answer books of Examiner B 
Where B awarded 11 Second Divisions A has awarded only one Where 
B failed 10, A has failed 37, m the same batch This is why B s mean is 
so much higher than A’s In the second column, we see what happens 
when B re-marks A’s answer books B awards two Firsts where A 
awarded none, and nearly three times as many Seconds But the number 
of Fails remains about the same This explains why B’s SD was much 
larger than A’s In both instances B marks much more leniently than 
A does 

S Pass Percentage 

A popular (though \ery crude and unreliable) measure of ‘ quality * 
is the Pass Percentage A glance along the next three rows shows how 
little trust can really be placed m this figure The first of these three 
rows shows the percentage passed in the Original marking, the second 
row the percentage passed in the Re marking and the third row the 
difference between these two percentages 

Notice that even the same examiner, re examining his own answer 
books, can differ from himself by as much as 58% in Pass Percentage 
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(Examiner D) Only two examiners’ Pass Percentages (Examiners C and 
G) remained exactly the same, and Examiner E’s was quite close But 
the remaining seven examiners showed wide changes 

Examiners B and I came fairly close to agreeing with A and X 
(whose answer books they re-marked) in Pass Percentage In examiner 
B’s case this was sheer good luck— means, SDs and percentages in each 
Division show that he agreed with A on nothing else When Examiner 
A re marked the answer books of B, the difference in Pass Percentage 
was 54% 

6 Examiner reliability 

There is a wide variation among the correlations between each pair 
of marks 

The self correlations of examiners reexamining their own answer 
books ranged from 4- 61 (Examiner B) to + 94 (Examiner E) The median 
was + 81, which shows a fair degree of self agreement This compares 
favourably with such studies done in other countries 

When Hindi examiners marked each other's answer books, the 
product moment coefficient of correlation ranged from + 66 (Examiners 
A-B), to + 88 (X-E), with a median of + 78 The range (strangely) is 
not so wide as that of the self-correlations, but the median is nearly the 
same 

The various statistics do suggest that B may have been unusually 
careless in his second marking If this is so, this is the only clear case 
of this among the 44 examiners m four subjects in this study 

But even a relatively high correlation of + 80 (B-A) does not look 
so good when you examine the differences in means (3 8), SDs ( 56), 
and Pass Percentages (80% vs 26%) This clearly illustrates the fact that 
the correlation coefficient alone is misleading, as it takes no account of 
differences in M and SD 

7 Content reliability 

The calculations for content reliability (see Chapter VII) were done 
on the basis of the original marks awarded each batch of answer books 
(Note that the batches for A, B, C, D are 100 each, for E, F, G, H, I, J 
arc 50 each, and for Examiner X the number of answ-er books is 300 ) 

The fact that there is a wide variety of content reliabilities for batches 
of answer books from the same examination needs some explanation 
In the first place, some of these differences may be due to different choices 
of questions However, probably more important is each examiner's style 
or method of marking Thus, two examiners may expect two different 
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Aincfs of answers (not complete!} different, of course, but different m 
significant aspects) from the same question For example, fre,h and 
original ideas may be highly prized by Examiner a as a sign of intelligence 
and creativity, but strongly disliked by traditionalist B who think* they 
show the student has not really understood the course Thu* the candi- 
date who happens to answer m a way which would have pleased Exam ner 
A gets a poor mark because, by chance, hts e\ a mmer is not A but B 
In a sense, therefore, we do not really have a single examination-)! ’ 
Jmc as man) examination papers as tte hate era tuners So studsits 
sitting for the same examination are ready sitting for different examina- 
tions and, because the student know* nothing about his exam ner, 
he does not know which examination he is sitting for This is in tins with 
Taj lor s remark (which repeats in India similar studies done in the 
West), that the candidate’s mark depends more on the luck of who hu 
examiner is than on his actual merit 

Another factor may be the particular choice of options in a paru 
cular examiners batch of answer books Harper’s formula considers 
all options to be equivalent— but jf they are not, this could lead to 
differences in the content reliability In general, ho vever, the op ions 
chosen were the same in every batch of answer books 

Content reliabilities range from 45 (for B) to 77 (for X) But even 
this highest xalue would be considered grossly inadequate m an objective 
examination 

8 Marks reliability 

As has been explained earlier, (Chapter VII) the only fair com- 
parison betyuen traditional and objective examinations is on the basis 
of what we have termed “marks reliability” This is the product of exa 
miner and content reliability We have calculated this only in the cases 
where a different examiner has re marked the answer books * Marks 
reliability' attempts to estimate the correlation between the marks of a 
batch of candidates who have answered examination I marked by Exa- 
miner P, and examination II marked by Examiner Q 

Marks reliabilities range from a low of 36 to a high of 63 Even 
this highest reliability is not high enough to climate, accurately, the 
relative accomplishment of a class or group in two different subject)— 
let alone estimate the true merit of any individual 

Of course, the reliability of the Hindi mark based on three pipers 
is somewhat higher than the reliability of a single paper However, the 
difference is not enough to justify the heavy reliance placed on Hindi 
marks 

In a sentence fftndt marks are unreliable 
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9. Standard error of measurement ( SEM ) 

The SEM ranges from 2.64 to 1.89 marks out of 34. (To compare 
this with other subjects, the range is 3.88—2.78 marks out of 50.) This 
means that even for the most accurate batch, there are only two chances 
in three that a given candidate’s mark will be within 2 6 out of 34 (3.9 
out of 50) of his “true” mark. For about 5 candidates in every 100, 
the error will be more than 11% of marks, even for this best batch. 
But, as has been pointed out, our concern should be for the largest (not 
the smallest) errors that can occur. A SEM of 2.64 means that for one 
candidate in each hundred (or 1,000 in every lakh), the error will be 20% 
of marks or more. And whether the mark awarded will be 20% above 
or below the “true” mark is a matter of sheer chance. 

But note that these estimates of errors take into account only the 
correlation — not the differences in M and SD. Thus they would be true 
only if the marks of all examiners were statistically scaled to a common 
standard. More exactly, each SEM shows the range of error which would 
exist if the re-marking by the second examiner were scaled to the mean 
and standard deviation of the original marking of the original examiner. 
The actual range of error, under current unsealed examination marking 
methods, would probably be much greater (see discussion in Chapter VII). 

10. Differences 

The last section of Table 12.2 presents the frequency with which 
each difference between examiners of a single answer book appeared. 
Note that this does not take into account the content reliability (as the 
SEM does), but only the actual difference between two markers of the 
same answer book. If each candidate had answered two parallel examina- 
tions, marked by different examiners, these differences would be much 
larger. 

The largest difference for the same examiner re-examining 50 of his 
own answer books was 8 marks. (This is equivalent to 12 out of 50, the 
same as biology and slightly higher than history.) There were two such 
differences. The details of one of them (Examiner B) are as follows: 



First 

Second 

Question 

Marking 

Marking 

I 

6 

3 

2 

li 

1 

3 

21 

21 

4 

3 

21 

5 

3 

1 

6 

3 

H 


~~ 19 

11 
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Notice that this candidate is dropped from the highest mark in the 
Second Division (19) to the fewest mark m the77i/rrf(l 1 ) You will remem- 
ber from Tabic 4 3 and Figure 4 1 that several Hindi candidates were 
changed mo Divisions from first to second marking, although the change 
m number of marks was not as much as above (A change from Fail to 
II requires only a change of 5 marks from 10 to 15 ) 

The largest difference for a pair of examiners was between Examiners 
D and C, and was 9 marks out of 34 (equivalent to 13 marks out of 50) 
The details arc as follows 


First Second 
Question Marking Marking 
1 2 0 
2 1)0 

3 H i 

4 Zf i 

5 1 1 

6 _JL * 

101 2 
-11 


Since this largest difference occurred once in 500 answer books, 
the probability is that there would be errors ’ of 9 or more for at least 
200 out of every fakh of candidates Slightly smaller errors would be 
very much more common (sec Table 4 I) 

Note that by chance the two examples chosen are both examples 
of the second examiner lowering the marks In history it was the other 
way around Changes in both directions arc nearly equally common 

It is also instructive to look at a case, chosen nearly at random 
of agreement between two examiners 


Firs! Second 

Qucslion Marking Marking 

1 3 0 

2 !i O 

3 1 i 

4 1 0 

5 1 1 

6 0 0 _ 

“ n 

-8 -8 


We will let the reader draw his own conclusions from these Marks 
The average differences are reported at of , , 

in Table 12 2 The smallest is 1 IS (Examiners C and H) and the largest 
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3 78 (Examiner A re-marking B) Thus the average differences range from 
3 5% to 11% of marks But talking of a\eragcs reminds us of the 
student who was refused a freeship because the “average income of your 
neighbourhood is over five lakhs”. Well, there were two houses on the 
street One was owned by a millionaire, the other by a man who earned 
only Rs 500 per year But their axerage income of Rs 5,00,250 debarred 
both their sons from freeships in the university 

So it is the largest differences, not the averages, which should 
concern us Even the Divisions of 36% of the candidates were different 
in the two markings There is little comfort in the “average”, when 
the examination can be so very unjust to so many 

11 Scatter charts 

Table 12 3 compares the work of Examiners A and B on their own 
and on each other’s answer books It also compares the re-marking of 
Examiners C and D with each other 

These and Table 12 4 are the individual scatter charts which are 
summarized in Table 4 3 

For example, look at the first chart in Table 12 4, showing what 
happened when Examiner A re marked his own 50 answer books Out 
of 26 that he awarded a Third Division the first time (line 3 and column T, 
which means Total), the second time he reduced 1 of them to a Fail and 
raised 4 to a Second Conversely, reading downward, of the 1 1 who were 
given a Second Division the second time, only 4 had originally been 
Seconds One had been a First Divisioner, 4 Third Dmsioners, and 2 
had failed the first time A marked these answer books 

The little dashes (— ) indicate zero cases at the points which would 
show perfect agreement on Division They are to help the eye to find the 
diagonal of perfect agreement in each chart 

At the bottom of the chart we also find the words, “Agree — 64%” 
This refers to the fact that 64% of the candidates received the same Class 
or Division in both markings by Examiner A 

The remaining charts may be read in the same way Note that the 
top row of four charts in Table 12 3 are the self-agreement of the four 
examiners, the bottom row are their agreements with each other 

The left half of Table 12 4 shows the self-correlation of Examiners 
E, F, G, H, I, J Note that even Examiner E, who had the highest self- 
agreement (+ 936) spread both his I and ri candidates over three Divi- 
sions, the second time he marked them 

Percentage agreement and r are, of course, different measures of 
agreement One reason why they are not closely related is that a perfect 
correlation of marks can exist, even if the Divisions are entirely diffe- 




aice of 40 is significant at 5% level of conf 
snee of 53 is significant at l % level of conf 
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rent Another reason » that the Division categories are of unequal size 
•'Fail" covers 1 1 marks (including 0), “III ’ only 4 marks 

The right half or Table 12 4 shows what happened to the six equi- 

to make the means and SDs eq“» a || S1 < of them with each 

at the bottom of each scatter c tr, comparison easy ) Examiner 

other (They are lined up to -^•^’“p^fo'rNo.e in the 
E awarded 14 Second pmsions, Emm » B , a „ d p equally 
left hand column that ‘ am l the second time 

stilt with their own answer books •« h d 25 tta r virtually 

Examiner J failed only 10 "tale d ep=„d as 

identical batches of answer books mmt 

much or more on the examiner as on 'he c." a « ( ^ note 

Comparing the two halves o _T bl^ h|ghcr correlations 

that four examiners, O, H. ' " Th „ als0 true of four of the 

with Examiner X than will. them*h« ™ out ,f these difference. were 
eight comparisons in Table 12 3 Fishcl - S z, and the differences 

Significant, all r’s were translated . F ^ (he ngM han d column 
calculated The last two lines of Table 1. j^nces, positive or negn 
nf Tabic 12 4 give the results None ol equally well whether 

treats significant* AlloftheHind.examtnem agm^J 
marking their own or someone e .u cre are wide difference* ) 

Z b ofo» whereas for history and 8=»“« „ «asons why Hind, 
The reader may want to speculate ^ “ , ter ,han they agree with 

Ixammers do no. agree wuh . fty ,es of marking 

rather than to any consrs.cn. er cent himself 

Examiner E shows ^ g4) Wlt h Examiner ^ mdica tes that 
but a lower corrf»«“^“ , he averse is measures of agree- 

For Examiners G H. by n0 means equ a perfect cor rela 

per cent agreement and ^ n<>t dM ely rela ed dJteren , Another 

meat One reason why th > . r , h .nivisionsareent y • covers 
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J2. Self-consistency of examiners' distributions 

We have already discussed the product-moment correlation as a 
measure of self-consistency. Another measure is the extent to which 
examiners tend to agree with themselves in M, SD, and percentages of 
marks in each Division. 

We get a more critical picture of this when wc examine each pair 
of examiners. Take, for example. Examiners A and B. Remember the 
experimental design: The hundred answer books originally marked by 
Examiner A were split into two equivalent batches of 50 each. Similarly, 
B’s ICO answer books were split jnto two equivalent batches. Fifty of 
Examiner A’s and 50 of B’s answer books were then mixed up in a random 
order, and sent to Examiner A for re-marking. Similarly, B received a 
mixed batch of both his own and A’s answer books. The examiners were 
told only that “some of the books” would be answer books originally 
marked by themselves, and some originally marked by others. In response 
to a questionnaire, examiners assured us that they were generally not 
aware which books were which. 

Here is the comparison of A and B (from Table 12.2): 

A B A B 

11 8 13 4 2 82 3.79 

Mean SD 

8 4 12.2 3 29 2.74 

In the first row, the 50 answer books to which A originally awarded 
mean marks of 11.8 were awarded mean marks of 13.4 by Examiner B. 
The second row must be read backwards. The batch to which B originally 
awarded a mean of 12.2, was awarded mean marks of 8.4 by A. Thus 
for both batches, B’s mean is consistently higher than A’s. However, no 
such consistency shows in the SDs. 

Here is the same comparison for C and D, who were also a matched 
pair in the experimental design: 


c 

D 

C 

D 

10.2 

9.7 

2 86 
SD 

2 95 

86 

10.7 

300 

2.73 


Again, as we also saw in the scatter charts, there is a lack of consistency. 
Tor more detail, let us compare the six examiners who were matched 
against nearly identical batches of Examiner X's marking: 


r 

r 

G 

it 

i 

j 


Meant when re-marking 
setr X 

13.7 (2) 12.3 (2) 

10.7(4) 107(6) 

14 4(1) 12.0(3) 

9 6(6) 116(5) 

114(3) 11.8(4) 

10.5(5) 13 2(1) 


SDs when rc-marfcing 
_ Self ~ 

3.99 (I) 3 44 (2) 

3 03 (2.5) 2 67 (6) 

3 03 (2 5) 2 75 (5) 

2 74 (4) 3 23 (5) 

2 65 (5) 3 54 (1) 

2 38 (6) 3 31 (3) 


TABLE 12 4 


Self agreement of Six Examiners in Hindi Marks awarded by See Hindi Exajkvisj 

to Approximately Equivalent Samples op 
Examiner X’s Candidates 
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The rank orders are given in parentheses besidej each number Notice 
the almost complete lock of consistency The rank order correlations 
are + 26 for the means and - 27 for the SDs-neither of which is signi- 
ficantly different from zero 

In history we found consistent differences in ‘standard of marking 
between various eaammers In Hindi, apparently, the samee xammer 
may be Iement one time and stilt the neat, or vice versa The differences 
the two markings seem to be doe more to the operation c random 
chance than to any consistent characteristics of individual , TthaTm 
If this is so, “scaling” would be even mote effect, ve in Hind, than in 
history in wiping out some of the inequities of the examination A he 
wise for the ***** C 

distributions n/iicti they expect, for aierage or typ SP 
(This ,s not a radical suggestion It ts, to fact, common practice 

countries ) 

13 Item analysis in „ r mb 

One of the most — P" 
carry out is an item analysis At ► . oofes ( say 1 00), arrange 
work Merely select a random group of a 1 nd thcn tabulate 

in order of merit, divide them into 

how well each of these groups did on eac Q t ^ roups 0 f 

We divided our thousand Htn - - ^ftuarktng Table .2 5 
200 answer books each, on the oasis 

presents the results of the Board supplying 

We had agreed to keep secre H.shmg the Hind, quest™ 
these data Unfortunately, this P recludc “ v £ ateJ> as it is faidy W cal 
paper However, its organization can b allowed wdhm 

There were 6 questions, all compulsoff, 
each question as follows 

Q 1 Answerl out of 1 1 , each of which had 2 altenia 

tives 

Q 2 Answer 1 out of 9 
Q 3 Answer 1 out of 4 

Q 4 Answer 1 out or 13 „r which has 

Q 5 Two compulsory parts, 2 

a choice of 1 from 2 alternatives 

Q 6 Compulsory (no alternatives) 

.Maximum Marks 


2+2 


10 marks 
5 

5 .. 

5 « 

= 4 .. 

5 « 
34 
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Table 12 5 gives the number of candidates of each ability level 
answering each question and their marks on that question For example, 
all 200 m the upper group answered Question 1, and their average marks 
were 4 92 The average marks of the top group are given at the extreme 
right, as 16 03 

Only one question. Question 4, shows serious non-linearity Com- 
pare the mean marks on this question for the third and fourth groups, 
with their total means 

Question 4 Total 
179 1126 

1 86 10 00 

In other words, poorer students did better on this question than 
better students 

Note, also, some inconsistencies in the number of students not 
answering Question 4 in the various ability groups 

The questions differ in difficulty However, in this case this is not 
“bad ’ When all are compulsory, it does not matter if one differs from 
another— for the differences affect all candidates equally 


TABLE 12 5 

Item Analysis or Hindi Higher Secondary Examination (For Description op 
Examination see Tbct) 


Question Number 

Max Marks 

1 

10 

2 

5 

3 

5 

4 

5 

5 

4 

6 

5 

Mean 

Marks 

Mean marks or 
group on question 
(Number m group 
choosing question) 

4 92 

(200) 

2 14 

(200) 

2 37 

(200) 

2 68 

(197) 

1 84 

(194) 

2.18 

(199) 

16 03 

(200) 

Mean marks of 
group on que>tion 
(Number in group 
choosing question) 

411 

(200) 

165 

(199) 

1 98 

(200) 

2.16 

(200) 

1.24 

(195) 

1 87 

(200) 

12.98 

(200) 

Mean marks of 
group on question 
(Number in group 
choosing question) 

3 97 

(200) 

147 

(199) 

1 65 

(200) 

179 

(193) 

82 

(195) 

1 65 

(199) 

11.26 

(200) 

Mean marks of 
group on question 
(Number in group 
choosing question) 

3.27 

(200) 

133 

(191) 

1.50 

(199) 

1.86 

(196) 

73 

(190) 

1 55 

10 00 

Mean marks of 
group on question 
(Number in group 
choosing question) 

182 

(200) 

8S 

(185) 

1 12 

(195) 

1 19 

(176) 

.39 

091) 

97 

(194) 

603 

(200) 
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We wonder, though, if the reader has noticed a senous flaw in this 
item analysis? This is based only on the parts — not on the original separate 
items If the reader wiU look at the outline of the question paper above, 
he wiU find that there were actually 41 separate questions in this examma 
tion For a true item analysis, we should analyse each of these 41 items 
separately We should then try to find out whether h ithm each Question, 
the vanous alternatives have equal value (See mathematics Tor an example 
of this ) As noted earlier, item analysis was a by product, not a central 
concern of this study To do a detailed item analysis for Hindi would 
have involved a great deal of work checXing which item each candidate 
had answered, as there was no indication of this on the cover of the 
answer book (from which most data were taken) It also would have 
involved the use of a much larger number of IBM cards, and a more 
complex computer design For both practical and financial reasons, 
this was not done 

So perhaps the major value of the item analysis presented here is 
to show that it is not particularly valuable to do an analysis by sections 
only, rather than by the actual items separately chosen 
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Detailed Statistics— Biology 


F OR BIOLOGY we used the zoology paper which consisted of eight 
questions, from which each candidate had to select five. Six 
questions had two parts each, with varying marks assigned to each 
part. If the candidate chose the eighth question, he had to answer four 
out of five parts. Each question carried ten marks, making a maximum 
of 50 marks for the paper. (For details see “Item Analysis”, at the end of 
the Chapter.) 

1. Distribution of marks 

Table 13.1 gives the frequency distribution for each batch of 50 
answer books, in biology. These are the detailed frequencies on which 
the totals given in Table 4.5 are based. 

Table 13.1-A shows the frequency distributions for the first marking 
of each batch of answer books. The top row names the original examiner 
of each batch. It is his marks which are distributed in this part of the 
Table. 

The hundred answer books of Examiner A were split into two 
equivalent batches of 50 answer books each. One of these batches was 
to be re-examined by Examiner A himself, the other by Examiner B, as 
shown in the second row of the Table. The answer books of Examiners 
B, C, and D were treated in the same way. The reader can see, from this 
Table, that the batches In ere as closely matched as was reasonably possi- 
ble. Some adjustments had to be made to make the means and SDs equal. 
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The 300 answer books of Examiner X were split into 6 equivalent 
batches The matching here, also, is as close as is feastble 

Table 13 1-B shows the marks awarded each batch of 50 answer 
books, the second time they were marked Compare, for example, the 
first pair of columns The two batches of Examiner A’s work, which were 
closely matched m the first part of the Table, are now awarded quite 
different distributions of marks by Examiners A and B 

The lines across the Tables show the divisions between the Classes 
or Divisions Notice the striking J effect m the first marking (131 A) 
A large number of candidates (70) were awarded 17 marks (passing), 
but hardly any (8) were awarded a failing 16 In re marking (13 1-B) 
this 3 effect js greatly reduced, but by no means disappears Notice that 
only Examiner C is completely consistent in using the J-effect both the 
first and second time with all four batches of 50 answer sheets that he 
marked (Examiners are referred to the footnote near the beginning 
of the Chapter on Hindi, for a good rationalization against the sort of 
educational brinkmanship represented by this tendency to raise 
borderline candidates by one mark ) 

There is also a pronounced J effect between the First and Second 
Divisions in the first marking In the second marking, this is reversed, 
with the extra cases being piled up just below rather than just above 
the line I (In history and Hindi, the J merely disappeared) The reverse-J 
noted between Second and Third Divisions in history and Hindi does not 
show up here Apparently, there is no special reluctance on the part of 
biology examiners to award a Second Class 

Note that only 34 of the 51 points (50+0) in the “scale" are actually 
used The marks from 0-3 and from 38—50 are wasted — they might just 
as well not exist One is inclined to think that the bottom marks may 
occasionally be used, but the top ones we know are certainly wasted 

2 Table 13 2 

Table 13 2 summarizes most of the statistical tables produced by the 
IBM “1620'’ computer The summary means and medians of these 
statistics have already been reported m Table 10 1 

The top row tells who originally examined the batch of 50 Biology 
answer books which provided the statistics in the column The second 
row tells who examined the answer books the second time The 
third row merely notes (for easier reference) whether the answer books 
were re-examined by the Same examiner, or a Different one from the 
original 

In the remainder of the Table, “O” refers to the Original marking, 
and “R“ to the Re-examimng 
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3. Mean and standard deviation 

The wide variety of means and SDs should perhaps not be taken 
too seriously. Though the samples of the work of each examiner are 
not believed to be biased in any significant way, they are also not 
strictly “random”. The exigencies of the situation required that we 
take the first hundred answer books that we could find to represent 
each examiner (see Chapter VI). There is no reason to believe that they 
were essentially different from the rest of the work of that examiner 
in any manner significant to this study. However, we cannot assume that 
they exactly resemble the mean and standard deviation of this whole 
batch. This fact does not affect the differences between marking and 
re-marking, which are the prime consideration of this study. 

The signs attached to the row of Differences between the means are 
arbitrary. When the second mean is higher, we have attached a “4*” — 
when the mean for re-examining is lower, we have attached a “ — 
This seems logical (or at least psychological!). 

The same comments apply to the SDs (standard deviations). 
A “-f” means the second examiner wadened out the distribution, a “ — ■” 
means that he attenuated it 

Hole, however, that in a sense the “direction” is really irrelevant 
“First” and “second" examiners could easily be interchanged. There 
are some consistent features of the “second” examiners — on the average 
they have marked a little lower, and with less J-effect How'ever, the 
far more important factor is the actual differences, regardless of “direc- 
tion”. 

Differences between means range up to 4.64 (which is 9% of marks) 
which occurs, surprisingly, when the same examiner re-marks his own 
answer books Differences in SD range up to 2.90 (or nearly 6% of 
marks.) 

For the differences between means, • indicates that the difference 
is significant at the 5% level of confidence; ** that it is significant at the 
1% level of confidence; and •** that the difference is significant at the 
0 1% level of confidence. 

The interpretation for the asterisks attached to the SDs is the 
same except that, unfortunately, we fiad no tables for significance at the 
0 1% level of confidence for this statistic. 

We may summarize these data as follows, giving the number of 
differences which are significant at each level of confidence: 

-ooi 

6 2 


M 

SD 


XX 


.05 


NS 

10 

16 



DETAILED STATISTICS — BIOLOGY 


211 


Re-Examjner 
Same D ff 


Neither M nor SD differences afc significant 3 4 

Afcan difference only is significant j 4 

SD difference only js significant 1 j 

Both M and SD differences are significant 1 


Remember that if the examiners are reliable (in the broad sense of 
the term), then the differences should not be significant We find seven 
cases (3 of self re marking, 4 with a different examiner) where both 
the differences were small enough to b: attributed only to random chance. 
Note, however, that this alone does not guarantee that both sets of marks 
will be the same The coefficient of correlation must also be taken into 
account When examiner B re-marked the answer books of examiner A, 
for example, they agreed fairly c!o>ely on mean and SD— but their 
correlation was only +73, producing an average difference of 3 24 
between individual marks 

Thirteen out of the 20 pairs were significantly different m means, 
SD, or both, indicating that factors other than sheer chance were opera- 
ting 

4 Distribution of Dims ions 

Those who do not understand means and standard deviations will 
get a clear picture of what is happening when they look at the next 
few rows These show how many candidates were awarded each Divi- 
sion, by each examiner This is a good way of knowing the subje.tive 
standard that the examiner has in mind when he marks the answer books 

In each major column, you will find two columns of figures in 
this section These show the number (out of 50) awarded each Division 
the first and second time, 1 e by the Original and Re-cxammcr 

Let us first examine the cases where an examiner re-marki hi* 
own answer books Look at the seventh column wilh D D at the top 
Examiner D awarded [I First Divisions the first time, and none the 
second time On the other hand, his Third Divisioncrs increased from 
10 to 21, and Fails from 2 to 7 Thus the second time his mean u lower, 
and his SD is also somewhat narrower Except for Examiners A, 11. 1, 
and J, there was a wide difference in the number anardeJ each DMsion 
M the same examiner, the first ami second time he marled his answer 
bools 

The differences were even wider when it was a different esamm-f 
who re examined the answer books For example, look at thefourth 
column. Examiner A re marking the answer books of Examiner B Though 
they agreed on the number of First and Third Divuioncrs Exan ner u 
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awarded twice as many Seconds as A, while A Failed more than three 
times as many candidates as B, in the same batch This is why B’s mean 
is so much lower than A’s Oddly enough, when B re marked A’s batch 
(column two), they agreed fairly closely except in the First Division 
But notice that B tends to have a narrower spread (a smaller SD) than 
A in both instances 


5 Pass Percentage 

A popular (though very crude and unreliable) measure of “quality” 
is the Pass Percentage A glance along the next three rows shows how 
little trust can really be placed in this figure The first of these three rows 
shows the percentage passed in the Original marking, the second row 
the percentage passed in the Re-markmg, and the third row the difference 
between these two percentages 

Notice that even the same examiner, re-examimng his own answer 
books, can differ from himself by as much as 18% in Pass Percentage 
(Examiner Q While this is a much smaller difference than we found 
in history and Hindi, it is still not negligible No examiners’ Pass 
Percentages remained exactly the same, although A’s, B’s, and G’s came 
quite dose 

Examiners X and H, surprisingly enough, agreed perfectly in Pass 
Percentage— though we note from the rest of the table that they differed 
significantly in mean marks, and had a rather low r But other Pass 
Percentage differences went as high as 24% (Examiners B and A) — 
though when it was B re marking A’s answer books, instead of vice 
versa, the difference was only 4% This latter fact again shows the lack of 
self-consistency of examiners 

6 Examiner reliability 

There is a wide variation among the coefficients of correlation 
between each pair of marks 

The self-correlations of examiners re-examining their own answer 
books ranged from +75 (Examiner D) to +91 (Examiner J) The 
median of + 83 shows a fair degree of self-agreement This compares 
favourably with such studies done in other countries 

When biology examiners marked each other’s answer books, the 
product moment coefficient of correlation ranged from + 67 (Examiners 
D-C) to + 83 (Examiners X— G), with a median of + 80 

But even a relatively high correlation of + 89 (Examiner Q does 
not look so good when >ou examine the differences in means (2 3), SDs 
(95), and Pass Percentages (10% vs 28%) This clearly illustrates the 
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fact that the correlation coefficient alone js misleading as it takes no 
account of differences in M and SD 

7 Content reliability 

The calculations for content reliability (see Chapter VII) were 
done on the basis of the original marks awarded each batch of answer 
books (Note that the batches for A,B C.D are 100 each, for E, F. G, 
H, I, J are 50 each, and for Examiner X the number of answer books is 
300) 

The fact that there is a wide variety of content reliabilities for batches 
of answer books from the same examination needs some explanation 
In the first place, some of these differences may be due to different 
choices of questions However, probably more important is each exami 
ner s style or method of marking Thus two examiners may expect two 
different kinds of answers (not completely different, of course, but 
different in significant aspects) from the same question In a sense, there- 
fore, we do not really have a single examination— we ka\e as many 
examination papers as we ha\c examiners So students sitting for the 
same examination are really sitting for different examinations— and, 
because the student knows nothing about his examiner, he does not 
Know which examination he is sitting for This is in line with Taylor’s 
remark (which repeats m India similar studies done m the West) that 
the candidate s mark depends more on the luck of who his examiner is 
than on his actual merit 

Content reliabilities range from 41 (for F) to 68 (for J) with a 
median of 63 But even this highest value of 6S would be considered 
grossly inadequate in an objective examination 


8 Marks reliability 

As has been explained earlier, (Chapter VII) the only fair comparison 
between traditional and objective examinations js on the basis of wh3t 
we have termed "marks reliability This is the product of examiner 
and content reliability We have calculated this only in the cases where 
a different examiner has re-examined the answer books, "Marks reliability 
attempts to estimate the correlation between the marks of o batch of 
candidates who have answered examination I marked by Examiner P, 
and examination II marked by Examiner 0 

Marks reliabilities range from a low of 42 to a high of 56 with 
a median of 50 Note that the range is narrower and the extremes are 
neither as low nor as high as were found in history and Hindi Apparently 
the best biology examinations are not so good as the best history and 
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Hindi exams— but the worst are not nearly so bad, either. 

But even the highest reliability of .56 is not high enough to estimate, 
accurately, the relative accomplishment of a class or group in two different 
subjects — let alone estimate the true merit of any individual. 

Of course, the reliability of the biology mark based on two papers 
is somewhat higher than the reliability of a single paper. However, the 
difference is not enough to justify the heavy reliance placed on biology 
marks 

In a sentence: Biology marks are unreliable. 

9. Standard error of measurement (SEM) 

The SEM ranges from 4.1 to 3.0 marks out of 50. This means that 
even for the most accurate batch, there are only two chances in three 
that a given candidate’s mark will be within 3 marks of his “true” mark. 
For about 5 candidates in every 100, the error will be more than 12%, 
for this best batch. But, as has been pointed out, our concern should be 
for the largest (not the smallest) errors that can occur. An SEM of 4.1 
marks means that for one candidate in eacn hundred (or 1,000 in every 
lakh), the error will be 21% of marks or more. And whether the mark 
awarded will be 21% above or 21% below the “true” mark is a matter 
of sheer chance. 

But note that these estimates of errors take into account only the 
correlation— not the differences in M and SD. Thus they would be true 
only if the marks of all examiners were statistically scaled to a common 
standard. More exactly, each SEM shows the range of error which 
would exist if the re-marking by the second examiner were scaled to the 
mean and standard deviation of the original marking of the original 
examiner. The actual range of error, under current uascaled examina- 
tion marking methods, would probably be much greater (see discussion 
in Chapter VII). 

10. Differences 

The last section of Table 13.2 presents the frequency with which 
each difference between examiners of a single answer book appeared. 
Note that this does not take into account the content reliability (as the 
SEM does), but only the actual difference between two markers of the 
same answer book. If each candidate had answered two parallel examina- 
tions, marked by different examiners, these differences would be much 
larger. 

The largest difference for the same examiner re-examming SC of his 
own answer books wa> 12 marks. This was a difference in the two markings 
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of one answer book by Examiner D The dclails are as follow 

Question 

First 

Second 

Marking 

Marking 

1 

51 

2 

2 

6 

31 

4 

6 

4 

7 

6} 

4 

8 

6 

4 


30 I7i=l8 

Tfte same examiner first awarded (his candidate a First Division, 
and then a Third Division* 

The largest difference between a pair of examiners was also 12 
marks It was between Examiners D and C, as follows 


First Second 

Question Marking Marking 

« 5) 

3 8 6 

6 5| 2J 

7 7 6 

8 7 2 

"331 "22 


It is interesting to note that the widest divergence was on Question 8, 
in which the candidate had to choose four out of five parts The marks 
awarded by the two examiners were as follows 
tl U 2 2 = 7 

1 J 0 0=2 

This largest difference of 12 marks occurred twice in 1,000 answer 
books. Thus the probability is that there would be “errors" of 12 or 
more for at least 200 out of every lakh of candidates Slightly smaller 
errors would be very much more common (see Table 4 1) 

Note that by chance the two examples chosen were both examples 
of the second examiner lowering the marks In history it was the other 
way around Changes m both directions are nearly equally common 

It is also instructive to look at a case, chosen nearly at random, 
of "agreement" between two examiners 


Question 

first 

Marking 

Second 

Marking 

1 

4} 

31 

2 

2 

3 

3 

41 

31 

7 

5 

41 

8 

_J1 

5 


141-20 

141-20 
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We will let the reader draw his own conclusions from these marks. 

The average differences arc reported at the bottom of each column 
in Table 13 2. The smallest is 1.70 (Examiners X and G), and the largest 
is 4.92 (Examiner D re-markmg his own answer books) It is amusing 
(or is it tragic?) that both the largest difference between means (4 64), 
and the largest mean difference (4.92) were for biologists re-examining 
their own answer books', not someone else’sl Abo, the smallest mean 
difference (1.70) was between two different biologists, rather than one 
re-marking his own answer books. 

The average differences thus range from 3.4% to nearly 10% of 
marks. This is approximately the same as found for history and Hindi. 
However, many people have died of heat stroke in cities whose average 
temperature is only 70° F. So it is the largest differences between exami* 
ners that are possible, that should concern us. Even the Divisions of 
38% of the biology candidates were different in the two markings. There 
is little comfort in the “average”, when the examination can be so very 
unjust to so many. 

11. Scatter charts 

Table 13.3 compares the work of Examiners A and B on their own 
and on each other’s answer books. It also compares the re-marking of 
Examiners C and D with each other. 

These and Table 13.4 are the individual scatter charts which are 
summarised in Table 4.3. 

For example, in Table 13.3 look at the third chart in the top row. 
This shows what happened when Examiner C re-marked his own 50 
answer books. The first time he awarded 4 First Divisions (line 1 and 
column T, which means Total) — but in the second marking, all 4 of them 
were reduced to Second Class. On the other hand, out of 22 Seconds in 
the first marking, 2 were raised to First and 6 lowered to Third in the 
second marking. Conversely, out of 14 who were Failed in the second 
marking, only 5 had been considered Failures in the first marking of 
these answer books, and 9 had been Third Divisioners. 

The little dashes (— ) indicate zero cases at the points which would 
show perfect agreement on Division They are to help the eye to find the 
diagonal of perfect agreement in each chart. 

At the bottom of the chart we also find the words, “Agree =58%”. 
This refers to the fact that 58% of the candidates received the same 
Class or Division in both markings by Examiner C 

The remaining charts may be read m the same way Note that the 
top row of four charts in Table 13 3 are the self-agreements of the four 
examiners; the bottom row are their agreements with each other 
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It is interesting to note that the self-correlation of Examiner D ( 75 ) 
is Io"er than his correlation with Examiner C ( 80) though the difference 
is not statistically significant There is an even wider difference between 
the percentage agreement on Division (44% vs 72%) These differences 
may be partly due to characteristics of the batches of answer hooks as 
when both of these examiners are working on identical batches the self 
agreement is higher (This is shown by each column of scatter charts ) 
Note however that percentage agreement is higher with the other than 
with the self on both occasions when dealing with identical hatches 
(72% vs 58% and 52% vs 44%) Why this should be so we do not 
know— but it does serve to demonstrate the fact that r and per cent 
agreement on Division arc two different statistics 

The last two lines of Table 13 3 give the differences m self and 
othcr-corrclation All r s were translated into Fisher s r and the diffe 
rences calculated Examiners A and B correlate significantly* less well 
with each other than Examiner B correlates with himself in one com 
parjson but not in the other or in comparison with A s self correlation 
The same is true of Examiners D and C 

The left half of Table 13 4 shows the self correlation of jExam ners 
E F O H I J Note that even Examiner J who had the highest self 
agreement (+909) spread both his II and III candidates over three 
Divisions when he re marked them Percentage agreement and r are 
of course different measures of agreement One reason why they are 
not closely related is that a perfect correlation of marks can exist even 
if the Divisions are entirely different Another reason is that the Division 
categories arc of unequal size Fail covers 17 marks (including 0) 

III only 6 marks 

The right half of Table 1 3 4 shows what happened to the six equivalent 
batches into which Examiner X*s 300 answer books were divided 
ir you look at the Totals (T) in the extreme right column of each chart 
you will see that the various batches did d ffer slightly (If there were 
eight First Divisioners two examiners had to receive two each while 
the other four had only one each Also some adjustments were made to 
make the means and SDs equaf j However took af the Total row at the 
bottom of each scatter chart comparing all six or them with each other 
(They arc lined up to make this comparison easy ) Exam ner G awarded 
31 Third Divisions while Examiner E awarded only II Examiner J 
faded 5 while Exam ner E failed IS in their virtually Identical batches of 
answer books 

Comparing the two halves of Table 13 4 it is interest ng to note 
that Examiner G has a higher corr elation with Examiner \ than with 

* See footnote on level of s gtuficance In Chapter XI Sect on 1 1 
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himself The difference is not statistically significant, but neither are the 
differences for Examiners E, F, H, I, and J The point is that all six of 
these examiners (as well as six out of the eight comparisons in Table 13 3) 
agree equally well whether marking their own or someone else’s answer 
books The reader may want to speculate for himself as to the reasons 
why biology examiners do not agree with themselves any better than 
they agree with each other In history wc talked about different “styles of 
marking”. For biology it may well be that examiners have the same 
general expectations, and that the differences are therefore purely due to 
random error rather than to any consistent differences in the type of 
answer expected 

Examiner C shows a lower per cent agreement but a higher r with 
himself than with D The same is true for D with himself and with C 
All of this indicates that per cent agreement and r are by no means equi- 
valent measures of agreement 

The interested reader may make other comparisons for himself 

12 Self-consistency of examiners 1 distributions 

We have already discussed the product-moment correlation as a 
measure of self-consistency Another measure is the extent to which 
examiners tend to agree with themselves in M, SD, and percentages of 
marks in each Division 

Wc get a more critical picture of this when we examine a matched 
pair of examiners Take, for example. Examiners A and B Remember the 
experimental design The hundred answer books originally marked by 
Examiner A were split into two equivalent batches of 50 each Similarly, 
B’s 100 answer books were split into two equivalent batches Fifty of 
Examiner A’s and 50 of Examiner B’s answer books were then mixed 
up m a random order, and sent to Examiner A for re-marking Similarly, 
B received a mixed batch of both his own and A’s answer books The 
examiners were told only that “some of the books” would be answer 
books originally marked by themselves, some originally marked by others 
In response to a questionnaire, examiners assured us that they were 
generally not aware which books were which 

Here is the comparison of A and B (from Table 13 2) 

A b A B 

22 9 21 7 555 493 

Mean SD 

194 220 SS7 4 87 

In the first row. the 50 answer books to which A originally awarded 
mean marks of 22 9 were awarded mean marks of 21 7 by Examiner B 
The second row must be read backwards The batch to which B originally 
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awarded a mean of 22 0, were awarded mean marks of 15 4 bv A Th- 
SDs are read in the same way Note that for both batche, B used a 
narrower distribution of marks than A No such consistency appears ,n 
the means 

Here is the same comparison for C and D, who were also a matched 
pair in the experimental design 

CD CD 

22 3 22.1 501 514 

Mean $D 

22.3 25 8 427 462 

Again, consistency in SD but not in mean One hypothesis that came 
up was that there may be a true consistency m means also, but that 
it is masked by the tendency to give slightly lower mean marks the 
second time When we made this correction, we found that D s raean» 
were slightly higher than C s on both occasions, but no such consistency 
appeared in A and B Nor (as we shall see next) is there any suck consis- 
tency for the remaining examiners 

For more detail, let us compare the six examiners who were 
matched against nearly identical batches of Examiner Xs marking 



Means when re marking 

SDs when rc marking 


Self 

X 

Self 

X 

E 

219(2) 

200(5) 

6 86(1) 

741 (1) 

F 

206(4) 

19 3(6) 

3 80(5) 

4.26(5) 

G 

23 8(1) 

207(2.5) 

3 65(3) 

3 83(6) 

H 

19 6(5) 

22 4(f) 

4 32(4) 

5 31 <2) 

1 

21 8 (3) 

201(4) 

5 08 (3) 

sn <3) 

J 

191(6) 

207(2 5) 

5 82(2) 

51J (4) 


The rank orders are given in parentheses beside each number 
Notice that here, as in the comparison of pairs above, the same pattern 
emerges— no consistency in means, but striking self-consistency in SDs 
The rank order correlation for means is — 27 (which is essentially zero)r 
for SDs it is -F 77 

How to account for this 9 Except for where Examiner D is involved, 
the differences m means are mostly quite small (see Table 13 2) This 
suggests that perhaps examiners agree pretty well on what the average 
standard should be, and that they vary around this average more or less 
at random However, they do not agree on the distribution or marks— 
how many high and low marks should be awarded ft is here that there 
are consistent differences in “ standard * among examiners Each examiner 
applies his own concept of how many shoufd fail, how many should be 
placed in each Division, to each batch of answer books that he marks 
Apparently the present instructions and “model papers’ , etc, tssueJ do 
not serve to control these differences in “standard ’ adequately It *ojU 
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be wise for the authorities to specify more exactly the distribution 
n/t ich they expect, for " average ” or “ typical ” groups at feast. (This 
is not a radical suggestion. It is, in fact, common practice in many 
countries.) 

13. Item analysis 

One of the most revealing procedures that any examiner can carry 
out is an item analysis. At the simplest level, it takes very little work: 
Merely select a random group of answer books (say 100), arrange in 
order of merit, divide them into several equal groups, and then tabulate 
how well each of these groups did on each question. 

We divided our thousand biology answer books into five groups of 
200 answer books each, on the basis of the first marking. Table 13.5 
presents the results. 

We had agreed to keep secret the identity of the Board supplying 
these data. Unfortunately, this precludes our publishing the biology 
question paper. However, its organization can be revealed, as it is fairly 
typical. 

There were eight questions, of which the candidate was to choose 
any five. The questions carried ten marks each, for a maximum of 50. 
Question 8 had five parts, of which the candidate (if he chose No. 8) 
was to answer any four. Each of .the other questions (except No. 5) had 
two parts, both compulsory for any candidate who chose the question. 
The marks for the parts were distributed in various ways: 2+8, 3+7, 
5+5, 6+4, 7+3. 

Table 13.5 gives the number of candidates at each ability level 
choosing each question, and their average marks on that question. For 
example, 138 in the top group chose Question 1, and their average marks 
were 5.7. The average total marks of the top group are 28.72, as shown 
at the extreme right. 

Note how different the average marks are for various questions, 
for students of equal ability (as judged by total marks). For example, 
a student in the high group who chose Question 8 would receive (on the 
average) 5.2 marks. But if he chose Question 7, he would receive 6.1 
marks — a bonus of nearly one mark just for his choice ! The differences 
in other ability groups is even wider. 

Gayen (1961, pp. 51—53) has emphasized the fact that a candidate’s 
total marks are determined partly by fns choice of questions. This can 
be illustrated for the bottom-scoring group (bottom row of Table 13.5), 
comparing the average marks of those who choose the five easiest as 
against those who choose the five hardest questions. 
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of each question separately. The results might have been more revealing 
However, difference in difficulty in parts is not so serious as differences 
between questions. This is because differences between parts affect all 
candidates who choose the question approximately equally. It is the diffe- 
rences between questions which lowers the reliability and validity of marks 
A more sophisticated method of item analysis might have revealed 
these and others facts in more precise detail. This was not done as this 
was not a central concern of this study. However, as a side-light, the results 
seem sufficiently revealing of one more source of weakness in the tradi- 
tional type of examination, to justify the effort put into them. 
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W HEN DISCUSSING objective examinations, how often 
wc hear teachers say, “But mathematics examinations are 
objective ’ 

If by this they mean that there is not much room for disagreement 
in marking, then of course, they are partly right 

However, if by this they mean that mathematics examinations are 
comparable to objective type examinations then they are absolutely 
wrong 

Objective examinations contain a much wider and more accurate 
sampling of the candidates knowledge and ability than do traditional 
mathematics examinations For this reason the results are far more 
reliable Mathematics examinations of the traditional type are unreliable 
because the sampling or questions is totally inadequate This defect 
is further enlarged by the fart that choice of questions is allowed, and 
the marking is on a scale very different from that used in any other 
subject 

All of this — and much more — will become quite clear as we examine 
the evidence uncovered in this study 

To represent mathematics we used the geometry paper This was 
the recommendation of the Secretary of the Board, who felt that the 
marking of the geometry paper was less purely mechanical than the 
marking of the algebra paper (Whether or not this is actually true would 
be an interesting subject for some future investigator ) 
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The geometry examination consisted of three groups or questions, 
with choices allowed within each group. The questions in two of the 
three groups had two parts each. Maximum marks were 50 for this paper. 
(For details see “Item Analysis”, at the end of this section.) 


1. Distribution of marks 

Table 14.1 gives the frequency distribution for each batch of 50 
answer books, in geometry. These arc the detailed frequencies on which 
the totals given in Table 4.5 arc based. 

Table I4.1-A shows the frequency distributions for the first marking 
of each batch of answer books. The top row names the original examiner 
of each batch. It is his marks which are distributed in this part of the 
Table. 

The hundred answer books of Examiner A were split into two 
equivalent batches of 50 answer books each. One of these batches was 
to be re-examined by Examiner A himself, the other by Examiner B, 
as shown in the second row of the Table. The answer books of Examiners 
B, C, and D were treated in the same way. The reader can sec, from this 
Table, that the batches were as closely matched as was reasonably possible. 
Some adjustments had to be made to make the means and SDs equal. 

The 300 answer books of Examiner X were split into 6 equivalent 
batches. The matching here, also, is as close as is feasible. 

Table 14.1-B shows the marks awarded each batch of 50 answer 
books, the second time they were marked. Compare, for example, the 
fourth pair of columns. The two batches of Examiner D’s work, which 
were closely matched in the first part of the Tabic, are now awarded quite 
different distributions of marks by Examiners C and D. 

The lines across the Tables show the Divisions between the Classes 
or Divisions. Notice the striking J-effect in the first marking (14.1-A). 
A large number of candidates (46) were awarded 17 marks (passing), 
but hardly any (8) were awarded a failing 16. In re-marking (14.1-B) 
this J-effect is greatly reduced, but by no means disappears. Notice tfi3t 
only four of the ten examiners (A, C, F, J) are completely consistent in 
using the J-effect both the first and second time with all four batches of 
50 answer books that they examined. (Examiners are referred to the 
footnote near the beginning of the section on Hindi, for a good rationaliza- 
tion against the sort of educational brinkmanship represented by this 
tendency to raise borderline candidates by one mark.) 

There is also a tendency to a J between First and Second Divisions, 
but within the context of the variation in frequencies of various marks, 
this is not very pronounced in the first marking. In the second marking 
it disappears altogether. (The same J appeared in history, Hindi, and 
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biology In history and Hindi it disappeared in the second marking, but 
in biology it was reversed ) There is also an interesting reverse-J between 
Distinction and First Division Apparently mathematics examiners are 
reluctant to give a Distinction and tend to pile up extra cases at the top 
mark of the First Division This reverse J also appears m history and 
Hindi (not in biology) — but is one Division loner, ie between, .First 
and Second This suggests (as docs other evidence) that “ Distinction " 
in mathematics is equivalent to "First" in other subjects It is quite 
clearly so psychometncally — the present evidence suggests that it may be 
psychologically, i e in the minds of the examiners, also Why mathe 
maticians are permitted to use a different scale of educational valbes is 
somewhat of a mystery 

Another striking difference between the mathematics “scale i and 
those of other subjects is the use of the entire range of marks, rathdr 
than just a portion of it This would generally be considered a desirable 
feature, if it were not for the fact that the distribution is so extremely 
flat — indicating (as discussed earlier) that it is a highly artificial scale, 
not a natural one, as it is unrelated to all we know about human abilities 
and achievements 

The heavy use of zero must also be questioned About three per cent 
of the candidates are awarded “0 It cannot be too strongly emphasized 
that, in the context of an examination zero is not a measurement It 
is rather, purely an indication of the inadequacy and invalidity of the 
examination (In circumstances where the standard of the test is objectively 
correct, zeros can indicate defective teaching— but that is obviously 
not the case here) Suppose we were using a thermometer with no marks 
below 0 Would the statement ‘‘temperature^O 0 ’ then have any meaning* 
No, of course not — because such a thermometer could not distinguish 
even between temperatures of — l p and — 100* A thermometer should 
show degrees below and above the coldest and hottest situations m which 
it is expected to be used A valtd examination should have a range wide 
enough so there are no zero and no 100% marks for fh'* group for which 
it is constructed 

Incidentally, there is also interesting internal evidence that ‘ 0 ’ 
is not a measurement If it is a valid measurement, hoW did it happen 
that there were 27 zeros the first time the answer books were marked — 
and 36 the second time an increase of 33%*> 

2 Table 14 2 

Table 14 2 summarizes most of the statistical tables produced by 
the IBM * 1620 computer The summary means and medians of these 
statistics have already been reported m Table 10 1 
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The top row tells who originally examined the batch of 50 geometry 
answer books which provided the statistics in the column. The second row 
tells who examined the answer books the second time. The third row 
merely notes (for easier reference) whether the answer books were re- 
examined by the Same examiner, or a Different one from the original. 

In the remainder of the Table, “O” refers to the Original marking, 
and “R” to the Re-examining. 

3. Mean and standard deviation 

The wide variety of means and SDs should perhaps not be taken 
too seriously. Though the samples of the work of each examiner are 
not believed to be biased in any significant way, they are also not 
strictly “random”. The exigencies of the situation required that we 
take the first hundred answer books that we could find to represent each 
examiner (see Chapter VI). There is no reason to believe that they were 
essentially different from the rest of the work of that examiner in any 
manner significant to this study. However, we cannot assume that they 
exactly resemble the mean and standard deviation of his whole batch 
This fact does not affect the differences between marking and re-marking, 
which are the prime consideration of this study. 

The signs attached to the row of Differences between the means 
arc arbitrary. When the second mean is higher, we have attached a 
—when the mean for re-examining is lower, we have attached a ”. 
This seems logical (or at least psychological!). 

The same comments apply to the SDs (standard deviations). A 
M -f ” means the second examiner widened out the distribution, a ” 
means that he attenuated it. 

Note, however, that in a sense the “direction” is really irrelevant. 
“Fust” and “Second” trammers co\dd easily be interchanged. There 
are some consistent features of the “second” examiners — on the average 
they have marked a little lower, and with less J-effect. However, the far 
more important factor is the actual differences, regardless of “direction”. 

Differences between means range up to 2.78 (which is 5.5% of marks). 
Surprisingly, this largest difference occurs when the same examiner re- 
marks his own answer books. The maximum difference in SD, 4 06 
(or 8% of marks) also occurs with the same examiner re-marktng his 
own answer books. The largest mathematics difference in means was 
smaller than the largest difference for any of the other three subjects — 
but the largest difference in SDs was second only to the largest biology 
difference. (However, Table 10.1 showed that on the average geometry 
and biology examiners did about as well as each other, and only slightly 
better than history and Hindi, in respect to these differences.) 
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For the differences between means, ‘indicates that the difference 
,s significant at the'5% level of confidence, •• that it ,s significant at the 
1% level or confidence, and *•• that the difference is significant at the 
0 1% level of confidence 

The interpretation of the astensks attached to the SDs " ‘ “f ) 
except that, unfortunately, we had no tables for significance a, the 0 1 . 

"'^irars— e r data ru fohows. gning .he number of 
differences which are significant at each level of confidence 


M 

SD 


Re-Examiner 
Same D‘ff 


Neither M nor SD differences ere significant 
Mean difference only is significant a 

SD difference only IS significant 4 

Both M and SD difference are sigmfican ^ ^ 0 f 

Remember that if the examineB < t We find only 

the term), then the examiner) i where both 

live cases (4 of self tc marking. “ b Ki only to random chance 
the differences were small onoughtu * *“ rMtK that both sets ofm rb 
Note, however, that this alone d“« must a |so be taken into 

will be the same The coefficie of ^ oM answer bools for 

account When Eaammer Cr falr Iy closely in mean a “ d of 

example, he agreed svit producing an average i r 

his self correlation was only I »•£» J marks the same answer 
2 72 between individual mar out of 50 

book, in fact, differed by as , nuch “ '' 5 , gn ,ficantly different m mean 
Fifteen out of the » 1®^ oth5r tba „ sheer chance 
SD or both, indicating t gD (though no th3t they 

ting On the basis of mean , erM ncians ,ndl “ A he “objec 

the biologists did better ton Which. again 

had a clearer common standar bc som ewhat o 

tivjty • of mathematics examinatio ^ 

4 Distribution of Dmsms and standard dt '“ , ““, f e » 

Those who do not undentmrf ® they took at to 
get a clear picture nf what were awarded each 

rows These show how many 
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by each examiner. This is a good way of knowing the subjective standard 
that the examiner has in mind when he marks the answer books. 

In each major column, you will find two columns of figures in this 
section. These show the number (out of 50) awarded each Division the 
first and second time, i.e. by the Original and Re-examiner. 

, Let us first examine the cases where an examiner re-marks his own 
answer books. Look at the seventh column, with D D at the top. Examiner 
D awarded 4 Second Divisions the first time, but raised it to 10 the second 
time. On the other hand, he reduced his Failures from 35 to 29. Thus the 
second time his mean is higher, but his SD remains about the same. 
Only Examiner E is completely consistent in “standard” for both mark- 
ings. The differences of the others, however, tend to be less than for the 
other three subjects in this study- 

The differences when a different examiner re-examined the answer 
books averaged about 4 cases (or 8%) in each Division. For example, 
look at the fourth column, where A re-examined B’s answer books 
They agreed on the number of Distinctions (4) but on nothing else. A 
awarded more Seconds, but fewer Firsts and Fails — which is why A had 
a smaller SD than B. When B re-marks A, the same differences appear. 
B tends to have a lower mean but a wider spread (larger SD) than A in 
both instances. 

5. Pass Percentage 

A popular (though very crude and unreliable) measure of “quality” 
or “standard” is the Pass Percentage. A glance along the next three rows 
shows how little trust can really be placed in this figure. The first of these 
three rows shows the percentage passed in the Original marking, the 
second row the percentage passed in the re-marking, and the third row 
the difference between these two percentages. 

Notice that even the same examiner, re-examining his own answer 
books, can differ from himself by as much as 12% in Pass Percentage 
(Examiner D). Of course, this is smaller than the difference found in 
biology, and much smaller than the differences in Pass Percentage in his- 
tory and Hindi. Still, it hardly inspires us to place the confidence in 
Pass Percentage that some people do. Examiner A’s Pass Percentage 
remained the same both times he examined his answer books — though 
he differed from himself in evety other Division. Examiner E*s Pass 
Percentage also remained unchanged, but there must have been differences 
among the various levels of Failure to produce the highly significant 
difference in mean shown. 

Examiners X and J, surprisingly enough, also agreed perfectly with 
each other in Pass Percentage-but they d.ffered in number of First 
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and Second Divisioners and had an average difference of 2 5 in pairs 
of marks But their dose agreement in means and SD seems lo ind care 
a fairly good agreement in general standard which could have 
been made much closer by adequate scaling The largest difference m 
Pass Percentages for d fferent examiners was 10% (while the largest 
for the same re examining his own papers was 12%) 

Is it fair to condemn a teacher because the Pass Percentage in his 
dass is 10% lower than in another teachers class ? Or shouhfwe not 
rather just attribute this to the general unreliability of traditional 
examination * 


6 Examiner reliability 

There is a wide variation among the coefficients of correlation 
between each pair of marks * 

The self corrections of examiners re-examm ng the r own books 
ranged from 809 (Examiner D) to 988 (Examiner A) The med an of 
98 shows a hich degree of self agreement This is in fact somewhat 
higher than the usual reports of studies done in other countries 
When mathematics examiners marked each other s answer books 
the product moment coeff cient of correlation ranged from 916 (Examiner 
D re-marking Cs answer books) to 965 (Examiner B re-marking A) 
with a median of 95 It is interesting to note that the lowest coeff cient 
was for self marking not for one re marking another s 

But even a correlation of 965 (A B>— wh ch is very h gh indeed— 
begins to lose its glamour when we note the difference between means 
(2 3) SDs (2 18) and average d fference between marks (3 70) — with 
some d (Terences ranging as high as 15 marks out of 50 This clearly dlus 
trates the fact that the correlation coeffic ent alone is mislead ng as it 
takes no account of differences in mean and SD 


7 Content rehabi! ty 


The calculations for content reliability (see Chapter VII) were 
done on the basis of the orjg nal marks awarded each batch of answer 
books (Note that the batches for A BCD are 100 each for E F 
G H 1 J are 50 each and for Exam ner X we have six batches of 50 


each ) 


* The reader may at fiat be incl ned lo d spute this The largest and smallest coixctr 
l ons are 988 and 809 respect vety However remember that the non t near tyof 
r is most pronounced fli the upper extremes If we translate all r s for all subjects 
(sec Table in Append x) into the r esju valent a s, we find the vanai on r * *“""2 
is the largest (w th a z d fference of 1 35) If story comes next w th both H od ana 
b otagy showing substantially less var ation 
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The fact that there is a wide variety of content reliabilities for batches 
of answer books from the same examination needs some explanation. 
In the first place, some of these differences may be due to different choices 
of questions. However, probably more important is each examiner’s 
style or method of marking. Thus, two examiners may expect two different 
kinds of answers (not completely different, of course, but different in 
significant aspects) from the same question. In a sense, therefore, we 
do not really have a single examination — we have as many examination 
papers as we ha\e examiners. So students sitting for the same examina* 
tion are really sitting for different examinations — and, because the 
student knows nothing about his examiner, he does no t know which 
examination he is sitting for. This is in line with Taylor’s remark (which 
repeats in India similar studies done in the West), that the candidate’s mark 
depends more on the luck of who his examiner is than on his actual merit. 

This interpretation is supported by comparing Examiner X’s six 
samples with each other and with the work of other examiners. Remember 
that the 300 answer books of Examiner X were divided into six more or 
less equivalent batches of 50 each. The content reliabilities for these 
batches range from .711 to .778. The difference between these figures is 
obviously small enough to be attributed only to sampling error. How- 
ever, the content reliability for Examiner X does differ significantly from, 
say, that of Examiner A (.904). Separate calculations for the separate 
batches of Examiner X were done only in the geometry paper. But if 
any reader wishes to do a similar study for the separate batches of the 
Examiner Xs in history, Hindi, and biology, the authors will be happy 
to supply him with all the necessary raw data. 

Note that content reliabilities range from .711 (for one sample 
of Examiner X) to .904 (A). Of the five examinations (i.e. the “same” 
examination marked according to five different systems) for which 
content reliabilities were calculated , only A’s would be considered really 
adequate in an objective examination. These reliabilities are substantially 
higher than the .63 estimated by Gayen (1961, page 71). 


8. Marks reliability 

As has been explained earlier (Chapter VII), the only fair comparison 
between traditional and objective examinations is on the basis of what 
we have termed ‘marks reliability”. This is the product of examiner 
and content reliability. We have calculated this only m the cases where a 
different examiner has re-marked the answer books. “Marks reliability” 
attempts to estimate the correlation between the marks of a batch of 
candidates who have answered examination I marked by Examiner P, 
and examination II marked by Examiner Q 
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Marks id.aM.t«s ranee from a low or 68 (D-C and X-J) lo a h, E h 
of 87 (A-B), with a median of 77. Tic hi £ te< of these might bc cons „ 
tiered acceptable m an objective examination, but mlj t»t> oat of the ten 
coefficient! even begin to approach adequate reliability to differentiate 
among individuals The best that can be said for them is just that 
they are better than those of any of the three other subjects m this 
experiment 


Of course, the reliability of the mathematics mark based on too 
papers is somewhat higher than the reliability of a single paper If the 
correlation between the two papers is quite high, the reliability of the 
total mark may approach the minimum acceptable However, we do 
not know the correlation, unfortunately— but it js this author’s hunch 
that it is not likely to be very high Even at best, only some mathe- 
matics marks would be adequately reliable for differentiating among 
individuals 


In a sentence Many mathematics marks, perhaps a majority, are 
unreliable 


9 Standard error of measurement ( SEM ) 

This unreliability is most clearly brought out by the wide SEM, 
which reflects not only the total reliability, but also (he abnormally wide 
range (SD) of Mathematics marks The SEM ranges from 6 17 to 3 47 
marks out of 50 This means that even for the most accurate batch, there 
are only two chances in three that a given candidate’s mark will be 
within 3$ marks of his “(rue * mark For about 5 candidates in every 
100, the error will be more than 14%, even for this best batch But, 
as has been pointed ou t, our concern should be for the largest (not the 
smallest) errors that can occur An SEM of 6 2 marks means that for 
one candidate in each hundred (or 1,000 m every lakh), the error will 
be 32% of marks or more And whether the mark awarded will 
be 32% aboie or 32% be/im the ' true” mark is a matter of sheer 
chance 

But note that these estimates of errors take into account only the 
correlation— not the differences in M and SD Thus they would be true 
only ff the marks of all examiners were statistically scaled to a common 
standard More exactly, each SEM show^lhe range of error which 
would exist if the re marking by the second examiner were scaled to the 
mean and standard delation of the original marking of the original 
examiner The actual range of error, under current unsealed examination 
marking methods, would probably bc much greater (see discussion m 
Chapter VH) 
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10 Differences 

The last sect, on of Table 14 2 presents the Trequcncy with which 
each difference between exam, nets of a angle answer l"**"** 
Note that this does not take into account the content reliability ( 

SEM does), but only the actual difference between two 
same answer book If each candidate had answered two parallel exam 
tions, marked by different examiners, these differences would be mu 
larger _ r 

The largest difference for the same examiner re examining a 
his own answer books was 20 marks (This was even larger than tne 
largest difference for a different examiner, which was 1 5 ) This was a flit ' 
rence in the two markings of one answer book by Examiner 1 
instructive to list the marks in the parts of each question, as v/e a 
the total The details are as follows 


Question 

First 

Marking 

Second 

Marking 

5 

1J+0 = H 

4+4 = 8 

6 

0+31 = 31 

4+4 = 8 

7 

0+0=0 

0+0 = 0 

9 

0+4=4 

1+4 = 5 

12 

0 =0 

1 =8 


9 29 

The same examiner first failed this candidate, then raised him to het 
top of the Second Division • In another case, this examiner had an 
18 mark difference 

The reader may be tempted to think that this examiner may J ust 
have been marking the answer books “at random”, and not taking the 
experiment seriously Of course, if there was undue carelessness (the 
hyphen is not a typographical error), there is no way of knowing whether 
this carelessness was in the original or in the second marking However, 
if the marking was purely at random, then there would be no consistency 
whatsoever in the marks As a matter of fact, however, this examiner 
does tend to give the questions roughly the same rank order each time 
he marks them— indicating clearly that he did re-read them, and did 
not just 4 invent the marks” Furthermore, no “at random” marking 
would produce a self correlation of 81 So we seem to have to conclude 
that the examiner has somewhat variable standards Since his errors 
are not \ery much larger than those of others, we must conclude that 
there are other “experienced examiners” whose marks are just as un 
trustworthy 
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The largest difference between a pair of examiners was 15 marks 
It was between Examiners A and B. as follows 


1 4+1=5 li+O = u 

2 2i+f = Si 1+0=1 

4 3+2 = 5 0+1 = 1 

7 4 +0 = 4 31+0 = 31 

9 0+1 = 1 0+0 = 0 

>2 3 = 3 0 =0 


211=22 7 

This largest difference of 20 marks occurred once m 1,000 answer 
books Thus the probability is that there would be “errors” of 20 or more 
for at least 200 out of every lakh of candidates Slightly smaller errors 
would be very much more common (see Table 4 1) 

Note that by chance one of the above examples was an example of 
the second examining raising the marks, and the other an example of 
lowering Changes in both directions are nearly equally common 

It is also instructive to look at a case, chosen nearly at random, of 
“agreement" between two examiners 

Question First Se~ond 

Marking Marking 

1 31+2 ~ 5J 4+3 = 7 

2 3 +1 =4 3 +0 = 3 

5 3}+0 = 3} 2J + 1 = 3i 

7 1+0=1 0+0 = 0 

9 0 = 0 0 =0 

It 0=0 0=0 


14 13i=14 

We will let the reader draw his own conclusions from these marks 

The average differences are reported at the bottom of each column 
in Table 14 2 The smallest is 1.20 (Examiner H) and the largest is 4 06 
(Examiner D re examining his own answer books) It is amusing (or is 
it tragic ?) that both the largest difference between means (2 78) 3tid the 
largest mean difference (4 06) were for a mathematician re marking his 
own answer books, not someone else s * Others did slightly better, of 
course 

The average differences thus range from 24% to 8% of marks 
This is not very different from what was found for history and Hindi 
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But when speaking ofa\erages we are painfully reminded of the 220-volt 
refrigerator which burned out even though the average electric current 
was just 220 volts. So it is the largest difference between examiners that 
are possible that should concern us. One out of every 17 of the diffe- 
rences in geometry was 8 marks or more out of 50 (16%). Even the Divi- 
sions of 23% of the geometry candidates were different in the two mark- 
ings. There is little comfort in the “average”, when the examination can 
be so very unjust to so many. 

11. Reliability and scaling 

Perhaps the most striking general conclusion is that mathematics 
examinations would probably profit more than any of the others by the 
introduction of scaling. Any method of scaling would help. All that is 
needed is to apply some judgement (or else a statistical formula), rather 
than using just simple arithmetic in calculating “per cent of marks”. 
The mathematicians, of all people, should not feel themselves bound to 
such elementary (and, really, rather irrational) procedures. 

Geometry examinations show a higher examiner reliability than any 
other. They also show a higher content and total reliability. The differences 
between the means are smaller than for any other subject. But all of these 
advantages are thrown away by using a fantastically large SD, and a 
fantastically flat distribution curve— a distribution which we know is 
artificial (or an artifact of the marking method), because it does not fit 
with any scientific knowledge of the distribution of human abilities. 
This large SD also produces unnecessarily large differences between 
marks, and unnecessarily large SEM’s — making geometry, in spite of its 
high correlation coefficients, actually, in the practical sense, the least 
reliable of the four papers studied m this report (see Figure 4.6). 


12. Scatter charts 

Table 14.3 compares the work of Examiners A and B on their own 
and on each other’s answer books. It also compares the re-marking 
of Examiners C and D with each other. 

These and Table 14.4 are the individual scatter charts which are 
summarized in Table 4.3. 

For example, look at the second chart in the top row of Table 14.3. 
This shows what happened when Examiner B re-marked his own 50 
answer books. The first time he awarded 8 First Divisions (line 1 and 
column T, which means total)— but in the second marking, 2 of these 
were reduced to Third, 1 reduced to Second, and 1 raised to Distinc- 
tion. Reading the chart up and down, we find that there were 10 Third 
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Divisions awarded the second time Out of these 2 had originally been 
Firsts, I a Second, and 3 had been failed by this same examiner the first 
time he marked the answer books 

At the bottom of the chart we also find the words, “Agree 68%” 
This refers to the fact that 68% of the candidates received the same Class 
or Division in both markings by Examiner C. 

The little dashes ( — ) in some of the charts indicate zero cases at the 
points which would show perfect agreement on Division They are to 
help the eye to find the diagonal of perfect agreement in each chart. 

The remaining charts may be read in the same way Note that the 
top row of four charts in Table 14 3 are the self agreements of the four 
examiners, the bottom row are their agreements with each other 
Notice the peculiar relationship between A’s and B's self-consistency 
and other consistency When A re marks A he correlates higher with 
himself ( 988) than when he re marks B ( 952) However, B correlates 
more highly with A in both instances (965, 952) than with himself 
( 912) This, and other comparisons within these three scatter charts 
suggest strongly that B was more careless in his original 0 e the actual 
Board exam) marking than in the re examination of his answer books 
The left half of Table 144 shows the self-correlation of Examiners 
E, F, G, H, I, J Note that even Examiner I, with a 985 self agreement, 
spread most of his re examined candidates over three divisions Com* 
pare, also, the other highest self-correlations— I, J, G, and (the highest 
of all) A 

It is interesting to note Examiner E This u the ONLY case out 
of Jorty tn the entire study, k here there was perfect self-agreement on 
Divisions And yet Examiner E did not have the highest self-/ - , the diffe- 
rence between his means was highly significant, and he averaged 1.28 
marks difference between the two markings, with one answer book 
showing a difference of 5 marks out of 50 (10%) the two times So his 
perfect Per cent Agreement here seems to be due primarily to the fact 
that he failed nearly all of his candidates both times 

Percentage agreement and r are, of course, different measures of 
agreement One reason why they are not closely related is that a perfect 
correlation of marks can exist even if the Divisions are entirely different 
Another is (as we have seen) that the Division categories are of unequal 
size “Fad" covers 17 marks (including 0), “III * only 6 ‘ Distinction" 
is twice as wide as any other category, also And not even Second and 
Third are equal I 

The other half of Table 144 shows what happened to the six 
equivalent batches into which Examiner X’s 303 answer books were 
divided If vou look at the Totals in the extreme right columnar each 
chart, you will see that the various batches did differ slightly This was 
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forced by the process of dividing Also, some adjustments were made to 
make the means and SD’s equal However, look at the Total row at the 
bottom of each scatter chart, comparing all six of them with each other 
(They are lined up to make this comparison easy ) The differences are 
not nearly so great as in other subjects However, Examiner H, for 
example, awarded 7 First Divisions, while Examiner J awarded 14 (twice 
as many) to a nearly identical batch Note that Examiner J is equally 
generous with his own answer books (first half of the Table), while 
Examiner H is equally stingy (But see below on self-consistency) 
Distinctions range from 10 to 16 per batch, and Failures from 4 to 8 

Do examiners agree with themselves any better than they agree 
with each other? Comparing the two halves of Table 14 4 gives a partial 
answer to this question, as did the comparison of the two rows in Table 
14 3 Sepecifically, the information is given at the extreme right of Table 
14 4, and at the bottom of Table 14 3 All r’s were translated into Fisher’s 
z, and the differences calculated Two thirds of the 14 pairs of differences 
are significant*— and in three of them (as we have seen) r is significantly 
higher for other- than for self correlation Specifically, five differences 
are significant and four highly significant (And two more would be 
significant if we used the one-tailed test) 

The reader may wish to speculate as to why self-agreement is so 
much higher than other-agreement in geometry In biology and Hindi we 
found no such difference In history, where there was a difference, we 
talked about “styles of marking Apparently there is no general 
agreement among geometry examiners (as there is among Hindi and 
biology examiners) as to what is required Possibly this is because an 
incorrect answer may be correct in some stages, but examiners may 
not agree as to how many marks to award for each correct stage Experts 
in this subject may have other explanations In any case, it seems 
evident that more specific instructions to examiners are probably needed 
Extensive research m the West indicates that such instructions can 
greatly increase the reliability of marking 

The above facts combined with wide differences in mean, SD, 
and number of cases in each Division, show that in geometry standards 
of various examiners also differ (More of this later) 

It should also be noted (again) that a good deal of the wide scatter 
m the various charts in these three tables is due primarily to the abnormal 
scale used by mathematicians m marking their answer books If their 
marks were properly scaled, as is done in many countries, and has 
been repeatedly recommended m India (e g Mahalanobis, 1934, Taylor, 
1963), Percent Agreements on Divisions would rise to close to 100% 

* See footnote on level of significance in Chapter XI, section 11 
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Those who still do not believe that the mathematics scale is different 
are referred to Harper (1963) where it is shown that students of equal 
ability (according to an outside criterion) receive different marks in biology 
and mathematics In the instance quoted, biology students with X amount 
of ablhty received an average mark of 64, whereas 
„ r n,, same amount of ability were awarded an 82 At the otner enu, 

SsSSSHSaSS 

two scales really equivalent , . 14 , d 14 4 

There are many other interesting 

books^s weflfaswhen ExaminerA remarks B s Note how far Examiner 

13 Self consistency of examiners' distributions „ a 

We have already 

measure of self consistency M SD, a „j percentages of 

examiners tend to agree with themselves 

marks in each Division whcn we eMml „ c matched 

We get a more critical picture of A an d B Remember 

pairs of examiners Take, for • b ks or ,g,nally marked 

the experimental design The hundred answer^ ^ of 50 elch 
by Examiner A were SP>‘* ml o two equivalent batches 

Similarly, B's 100 ta "xatiner B's answer books were 

Fifty or Examiner As and 5 to Examiner A for * 

then mixed up m a random ° r *£ d h o[ bot h his own and A s 
marking Simdarly, B received a mixeu o ltet .. som e of the book 
answer books The examiners rkad by themselves, some orWjnBj 
would be answer books originally ™ r ““J tlonMire examiners assured 
marked by others In which books were which 

us that they were generally B (from Table 14 2) 

Here is the comparison of A . B 

a B 

14 47 1666 

26 3 24 0 SD 

Mean 1044 llW 

21 1 9 L ir h A originally awarded 
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awarded a mean or 19.7 were awarded mean marks of 21.1 by A. The 
SDs are read in the same way. Note that for both batches, B used a wider 
distribution of marks than A, and had a lower mean. 

Here is the same comparison for C and D, who were also a matched 
pair in the experimental design: 


c 

D 

C 

D 

11.7 

11.4 

6.76 

8.63 

Mean 


SD 


11.1 

11.1 

9.50 

8.45 


Examiners C and D do not seem to show the consistency that A and 
B did. 

For more detail, let us compare the six examiners who were matched 
against nearly identical batches of Examiner X’s markings: 



Means when re-marking 

SDs when re-marking 


Self 

X 

Seir 

X 

E 

7.7 (6) 

29.2 (2) 

7.79 (6) 

12.05 (2) 

F 

22.1 (2) 

28J (6) 

9.80 (3) 

10.83 (5) 

G 

24.6 (1) 

28.7 (4) 

10.79 (2) 

11.78 (4) 

H 

10.6 (5) 

28.6 (5) 

9.53 (4) 

12.66 (1) 

I 

16.0 (4) 

28.9 (3) 

8.78 (5) 

11.86 (3) 

J 

19.1 (3) 

31.1 (1) 

12.08 (1) 

10.69 ( 6 ) 


The rank orders are given’in parentheses beside each examiner. As 
in C-D above, there is no consistency. The rank order correlations are 
actually — '.31 and — .77 for means and SDs respectively. If there are 
any consistencies (and we seemed to discover some in Table 14.4), they 
are completely masked by the wide variety of means and SDs used. It 
would seem at least possible that Examiner X’s batch just happened to 
be one of exceptionally high ability (every examiner awarded higher means 
to X’s answer books than to his own), and also an unusually wide range 
of ability (all but one gave larger SDs). If we also assume an unusual 
amount of variability in Examiner E to J’s own batches, these two facts 
mask any consistencies that might otherwise appear. 

This points up the fact that scaling each examiner’s work separately 
on an arbitrary statistical basis (as some advocate) may lead to consider- 
able injustice. There are three possible approaches: 

(a) The answer books may be randomized before sending to the 
examiners, as suggested by Taylor (1963). Then purely statistical scaling 
can be applied. 

(b) Or a more judgemental scaling method may be used, as 
recommended by Gulliksen (1950) and adapted to Indian conditions by 
Harper (1963). For large groups, the Board would provide detailed 
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instructions on "standards” and "distributions” of average or typical 
groups 

(c) If examiners do not wish to exercise their judgement, there 
is a simple method whereby each examiners "personal equation" 
(so to speak) can be determined This is described by Harper (1967, 
foot note on page 36), and in the Chapter on scaling in this book 

14 Item anal) sis 

One of the most revealing procedures that any examiner can carry 
out is an item analysis At the simplest level, this takes very little work 
Merely select a random group of answer books (say 100) arrange in order 
of merit, divide them into several equal groups and then tabulate how 
well each of these groups did on each question 

We divided our thousand geometry answer books into five groups 
of 200 answer books each, on the basis of the first marking Table 14 5 
presents the results 

We had agreed to keep secret the identity of the Board supplying 
these data Unfortunately, this precludes our publishing the mathe- 
matics (geometry) question paper However, its organization can be 
revealed, as it is fairly typical 

There were three compulsory groups, with a choice or questions 
within each group, as follows 

Group A Answer 3 out of 6 

each had 2 parts, carrying 4 marks each (total 8x3=24) 

Group B Answer 2 out of 4 

each had 2 parts, one carrying 5 marks, the other 4 marks 

(total 9x2=18) 

Group C Answer 1 out of 3, only one part each (total 8) 

Thus the candidate had to select a total of 6 questions, for a total of 
50 marks* 

Table 14 5 gives the number of candidates at each ability level 
choosing each question, and their average marks on that question 

It is unfortunate that Questions 12 and 13 had to be lumped into 
the same column The reason for this is simple The Board s standard 
answer book cover provides only for 12 questions (Why are not separate 
covers printed for separate subjects' 7 ) So the examiners had no choice 
but to put marks for both Questions 12 and 13 in the last space This 
was not discovered by us u ntil far too late for the very large job of sorting 

* Though wc were not looking for them errors in marking sometimes turned up 
For example at least one geometry candidate answered seven questions— instead 
of the prescribed six— and received credit for alt seven Such errors do not occur 
when all candidates are required to attempt ail questions, as is recommended by 
all experts in educational measurement 
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through all one thousand answer books, and finding which candidates 
had answered question 12 and which Question 13. 

Look at the last column, marked Total. These are the average marks 
of each group of 200 candidates. From the averages it is evident that 
the five groups are quite steeply graded in ability level, as reflected in 
total marks. 

Now look down the first column, i.c. the column of mean question 
marks on Question 1. The top group had an average of 5.5 on Question I. 
The second group's average was 4.2, but the third’s was 4.9. Even though, 
on the Total the third group was 1 1 marks below the second group, it 
did better than the higher group on Question I. There are several other 
such reversals: Questions 3, 5, and 6 show them between various groups. 
These reversals produce negative validity, at least at these points — 
relatively, they lower the marks of higher students, and raise those of 
lower candidates, quite irrationally. 

Now look along the top row — or any other row. Notice how 
different the mean marks are for supposedly equivalent questions, 
answered by students of equal ability (as judged by total marks). These 
differences must, of course, be examined separately in e3ch Group of 
questions. Look at Group A (Questions 1-6) for the top group. A student 
who chooses Question 1 gets (on the average) 5.5 marks, while one who 
chooses Question 5 gets 7.9 marks — a bonus of 2.4 marks just for his 
choice! In Group B (Questions 7-10), the marks awarded Question 10 
are 4 marks out of 9 (44%) higher than those awarded Question 8, for 
this ability group. For other ability groups, some differences are larger, 
some smaller. 

Gayen (1961, pp. 51-53) has emphasized the fact that a candidate’s 
total marks are determined partly by his choice of questions This can 
be illustrated with, say, the second-scoring group (second row) of Table 
14.5. Compare the average marks of those who choose the easiest as 
against those who choose the hardest questions in each group. (In Group 
C we will have to treat Questions 12 and 13 as one question, because we 
lack separate data, as explained above.) 


Question 

Marks 

Question 

Marks 

A 2 

7.1 

A 3 

36 

5 

80 

4 

4.2 

6 

78 

1 

4-2 

B 9 

4.1 

B 8 

2.1 

10 

5.6 

7 

40 

C 12/13 

54 

C 11 

4 l 


38 0 


22 ~ 


This is nearly 16 marks out of 50—32% of marks — difference just between 
different combinations of questions. Among students of equal ability 
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(ie with the same average total marks) Some could receive a Jlf 
Division and some a Distinction, based only on a lucky or unlucky 
choice of questions > Obviously, a candidate’s Division quite often 
depends on which questions he chooses, even more than on how 
much he knows If the reader cares to work it out for himself, he 
will find that in eierj group candidates of equal ability can get different 
marks just by happening to choose one or another set of questions We 
have argued elsewhere that the examiner who allows a choice of questions 
is stating that he believes the questions to be of equal difficulty Since 
this is very obviously not true, and since the candidate has no way of 
knowing which questions the examiners will mark ‘ stiffly ’ and which 
“easily”, the traditional essay type examination ends up as a sort of 
guessing game The student who is lucky enough to make the right 
guesses gets higher marks than he deserves — the poor fellow who guesses 
wrong suffers unjustly 

There is universal agreement among experts in the science of 
educational measurement that (except where necessitated by the use of 
different courses or texts) there is no justification for allowing alter- 
natives, or a choice of questions in any examination That this choice 
factor is a major source of unreliability and, more important, of in- 
validity has been demonstrated by Harper (1962) The 4 chance ’ f ictor 
in this type of examination can only be described as huge 

Another indication of inconsistency is the fact that Question 2 is 
easier than 1 for the highest group— but for the lowest group it is the 
other way around, with Question I receiving the higher marks This 
implies very different distributions of marks (wide and narrow) for 
various questions The scaling of marks for individual questions might 
help a great deal in increasing validity of total marks— although of 
course, eliminating options is both easier and more effective 

It is unfortunate that circumstances did not permit a more detailed 
tabulation of each part of each question separately The results might 
have been even more revealing However, difference in difficulty in parts 
is not so serious as differences between questions This is because 
differences between parts affect all candidates who choose the question 
approximately equallj It is the differences between questions which 
lowers the reliability and validity of marks 

A more sophisticated method of item analysts might have revealed 
these and other facts in more precise detail (see Gayen 1961) This was 
not done as this was not a central concern of tins study However, as a 
side-light, the results seem sufficiently revealing of one more source of 
weakness in the traditional tvpc of examination to justify the effort 
put into them 
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T HE MARKS of all examiners ora single subject should be scaled 
to a common standard 

All examination marks in different subjects should also be scaled 
to the same standard 


Scaling is the single most important change which should be intro- 
duced into the present examination system now And it can immediately 
raise the reliability (in the broadest sense) of examination marks (Misra 


1969, as well as this chapter), without any other change whatsoever 


It is also the single easiest change It requires no change in the tradi- 


tional essay type questions It requires no re training of examiner;, It 
requites uo ftesh or special m&ttuctiaos to students It requires no re- 
education of the public It requires only an administrative decision 


Scaling was recomm-nded by Mahalanobis as early as 1934 It has 
been repeatedly recommended, by a variety of writer^ and commiasioru, 
since then Surely forty years is a long enough “educational lag” so 
that the idea should no longer be considered so radical that it should be 


postponed, yet again “for further study * 


And scaling is so simple that some form of it is used even by 
elementary school teachers m countries like the U S A Even American 
school children understand and accept the idea of “grading on the curve”, 
i e the application of the normal distribution curve to the problem of 
marking examinations The senior author has found both teachers and 
students m India quite willing to understand and accept the idea, after 
a simple explanation (see Harper, 1963 a) 
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We have repeatedly mefflioncd the need f or scaling „r exammation 
ntarhs, throughout this book We will not bore the reader by nnotuig, 
again, all the various types of evidence which poult to this need This 
chapter n, 11 confine itself only to the highlights, and to a brief discussion 
of some of the practical methods involved 


1 What is scaling ? 

Your friend says, “The temperature is 40° today” You think, 
* Jt should be very cold— but it is quite hot! ' Then you remember 
that you are old fashioned and prefer to think of degrees Fahrenheit, 
but your friend is talking about degrees Centigrade So you decide to 
convert 40° Centigrade into the equivalent number of degrees Fahrenheit 

How? You apply the formula +32=F, where F is the degrees 

Fahrenheit, and C the degrees Centigrade Here it is ? ^ t°-f 32= ~ + 32 

=72+32=* 104® No wonder it seems hot— it is 104" today 1 

Perhaps you used to think that “a degree is a degree" But obviously, 
a degree Fahrenheit is different from a degree Centigrade The meaning 
of the word “degree” is different for the two scales Similar!}, the meaning 
of the hard "mark" is different for geometry and for history— am 1 it is 
usually even different for the "marks ’ Professor Ram Das awards in 
history, and the “marks” Professor Mohan Lai awards m history 

4 Scaling ’ is any process which makes the meaning of the word 
“mark” the same for Prof Ram Das and for Prof Mohan Lai, and the 
same for both history and geometry 

When different examiners of the same subject use different scales, 
then a student’s Division may depend more on the “chance" factor of 
who his examiner is, then on his actual merit In the extreme, one student 
may get a First while another student of equal ment is declared a Failure 
just because their answer books happen to be assigned to examiners who 
use widely different scales As we have seen repeatedly, each examiner 
has, so to speak, his own “personal equation” Scaling simply means 
finding what this “personal equation” of each examiner is, and correcting 
the marks which he has awarded, accordingly 

The effect of using different scales in different subjects is somewhat 
different from the above A “mark" of “70 out of ] 00 ’ in mathematics 
just does not mean the same thing as it does in, say, English (Govern 
ment of India, 1966, p 201) In English, this mark means, “This is an 
outstanding student " In mathematics it means, “This student is above 
average, but not really very brilliant ” (In some countries, in fact, a 
mark of 70 in mathematics would be considered failing ) 
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The scaling methods applied to the marks of different examiners 
of the same subject are similar to those which are applied to the marks 
in different subjects But let us discuss the former situation first How 
can we ensure that the marks of all examiners, marking the same 
subject, will have the same meaning in terms of the actual merit of 
the candidates ’ 


2 Specified distributions of marks 

The simplest method of scaling possible is the concept of "grading 
on the curve” This assumes that (unless there is strong evidence to the 
contrary) each batch of answer books marked represents an average 
group Since abilities are assumed to be normally distributed, the marks 
should be normally distributed A certain proportion of them is, 
therefore, expected to fall in each of the Classes or Divisions If 
approximately this proportion does not appear in each Division, then 
the marks are re adjusted accordingly 

To some extent, this is what every examiner does “in his head” 
The real problem is this No authority has stated or fixed the proportion 
that should fall in each Division, for a normal group The result of this 
is that each examiner has his own idea, and hence in practice examiners 
use a wide variety of marking scales As we hate seen (Table 8 3), the 
average marks awarded the same batch may range from 8 to 22 with 
equally wide differences m standard deviation 

Specifying the distribution of marks expected for an average group— 
and holding examiners to it — would tend to eliminate at least some of 
the widest differences between the different scales used by different exami 
ners And this can be done it it bout ait} change at all in the present exatmna 
Sion system, except for she giung of more adequate instructions 

The instructions of the examining authority to its examiners might 
be something like this 

"We expect that the average mark of all the candidates in this 
examination will be approximately marks We expect that the 

pass percentage will be approximately , with a variation of not 
more than 3% or 4% in either direction We expect that approximately 
the following percentage will fall m each Division Distinction " 6f 
First %, Second % Third %, Fail % 

“After marking your answer books, re arrange them in order of 
merit Then count the number m each Division, and calculate the 
percentage m each Division and the Pass Percentage If you feel that 
the batch you have marked is an average or typical batch, then your 
figures should be very close to the above If they are not, please readjust 
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50 tta ,hey 81 tt£ sbaK -ten tattM 

«-*• ' ! " ed 10 cta "S= <1* marts on individual questions, only the 

Sowe examiners, of coarse, may have received an extra good batch 
or an extra weak batch You still should tabulate your results and compare 
them with the ab ose-memioncd average distribution Then ask yourself 
seriously, ‘1% my batch really so much weaker than the average gro «p 
as to justify the amount of difference in the marks I have awarded them, 
from the average distribution specified'? * Or, 'Is my batch really as much 
better than average as the marks I have given them seem to imply!’ 
in most cases, the examiner will find he has to readjust hts marks upward 
or downward If his batch is extra weak or extra strong the distribution 
does not need to fit the one specified above But it should not be too 
different ’ 


Initially, instructions may have to be somewhat more detailed 
than this It is suggested that they be tried out on representative examiners 
and then clarified as needed A tabulation form can also be provided, 
to facilitate understanding and application of the general principles 

The percentages in each category can be obtained by tabulating a 
random simple of previous years' results 

It may be objected that the entire group of examinees may be stronger 
or weaker one year than another This may be true, but ct « highly doubt- 
ful that the differences are usually as much os the variations in pass 
percentages seem to suggest It is more likely that the variations in pass 
percentages reflect variations in difficulty of questions, and variations 
in the scales used by the examiners, at least as much as they reflect any 
actual variations m achievement (Government of India 1956, p 291) 

In any case, the flexibility allowed individual examiners m the above 
instructions will still allow for a year’s entire batch being higher or lower 
than that of a previous year 


3 ‘£<m * marks awl "scaled marks 


The above will be facilitated if examiners can be taught to think 
m terms of a two-stage marking process. Tim is not necessary, but <f 
will help The first marks they assign should afwajs be considered raw. 
or unsealed marks This, then, frees them to assign these marks in any 


* There « a very human tendency, when ficed with a major new idea. to Jw* 
cat because of some minor defect A pr«t.ctabte number or persons 
this procedure sa> ing But it nukes scrutiny impossible U does not ■<** 

sliU be based on add.ng up the marks awarded each r<P«bcn ** *2? 

raw mart « correct Then we tan check if the scaled mark oss.jned th« raw mark 
is she same as for other identical raw marks 
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way that they like. For example, some may find it simplest to mark every 
question “out of 5”, and then to double the marks or those where the 
maximum is 10. Much psychological research suggests that the most 
reliable marking might be to assign a maximum of 7 marks to each ques- 
tion, since seven seems to be the largest number of categories in which 
human judgement can reliably differentiate anylhing complex. These 
raw marks, however obtained, are then added up to produce a total raw 
mark. 

These total raw marks are then translated into scaled marks. Scaled 
marks are the marks to be reported to the students, and it is these which 
should fit the d.stribution of marks specified. If the examiner feels that 
his whole batch is extra good or extra weak, then he can change the speci- 
fied distribution — e.g. decide that the average should be 50 rather than 45, 
or that there should be 3% rather than 1% of Distinctions. Then he 
ass'gns scaled marks to his raw marks, so that they will fit this distribution. 

The simplest method of converting raw to scaled marks is this: 
First decide what the highest scaled mark should be, and what the lowest 
scaled mark should be. The highest raw mark is then set equal to the 
highest scaled mark and the lowest raw mark to the lowest scaled mark. 
The remaining marks are distributed proportionately, in between. 

A slightly better method has been proposed by GuIIiksen (1950), 
and adapted by Harper (1963 a) to Indian conditions. The senior author 
and his colleagues have taught this to many Indian teachers, and found 
that it is easily and quickly understood and accepted. 

4, The role of the Head Examiner 

The Head Examiner, or Dy. Head Examiner, re-examines a number 
of answer books from each examiner, chosen at random. 

First, let us consider the purpose of this re-examining. Firstly, it 
seems to be to ensure that each examiner is actually marking the answer 
books, and not just assigning numbers at random without regard to 
ment. But this is a minor purpose — if examiners cannot be basically 
trusted to do tbeir work conscientiously, then we may as well throw out 
the whole examination system. (As we have seen, there is ample evidence 
in the studies reported in this book, that Indian examiners are at least 
as careful and conscientious in their v.ork as those of other countries ,) 
A second, and more important, purpose in the Head Examiner’s re- 
examining seems to be to ensure a constant standard among examiners. 
But we have frequently heard experts complain that the system really 
does not do what it is supposed to do, i.e. equalize standards. (In the 
terms we are using in this chapter, we may say that the system does not 
in practice ensure that all examiners will use the same marking scale.) 
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Most research studies done m India (sec Chapter V), as well as the two 
reported m this book, amply justify this complaint 

Can the situation be improved? 

The Head Examiner (or Dy Head Examiner) re examines batches of 
answer books chosen at random from the whole batch of the work of 
each examiner Why at random from the whole batch ? A better procedure 
would be as follows 

The Head Examiner selects a few answer books marked as First 
Division by each of the examiners He then compares these with each 
other to see jf the standard « constant He does similar studies for 
Distinction, Second Division, Third Division, and Fad In each case, 
he should select answer boob with marb within, say, 5 marks of the 
middle of the Division 

Psychologically, concentrating on a single level of ability at a time 
is a much simpler task than trying to assess a large number of ability 
levels, mixed up at random Thus it can be done with a much higher 
degree of reliability In fact , the Head Examiner will be able to make 
fairly precise statements as to exactly how much a particular examiner’s 
mark of 50 should be lowered or raised to make it conform to the general 
standard 

If the Head Examiner can wait to begin his work until all of the 
examiners have finished their work, then be can re-adjust the marb of 
each examiner himself according to hts findings (Or he can specify the 
adjustments needed, and have a clerk or assistant do n) Jf— as is more 
common — the Head Examiner must examine answer boob before 
examiners have finished, and send them back for re examining, then 
we would suggest this The Head Examiner specifies the number of 
answer boob in each Division to be sent to him He may, for example, 
say, “Each examiner js to send me 3 answer books with marks between 
65 and 70, five with marks between 50 and 55, five with marb between 
35 and 40, and five with marks between 20 and 25 M He then compares 
those in each category with each other, and tells each examiner how much 
(on the average) his marb m each category should be raised or 
lowered 

The above methods will of course have to be adapted to the specific 
traditions and administrative techniques and problems of each separate 
examining authority It should be evident, however, that the changes 
required in present procedures are minimal Yet even these minimal 
changes can greatly reduce (though not eliminate) the wide variation 
in standards of different examiners, variations which now make chance— 
i e the chance of who the examiner happens to be— actually a greater 
determiner of a student’s marks than is his true merit 
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5 Randomization and equating 

More «opb)slicated, mathematically based methods of equalizing 
scales and standards will, of course, be eien more elTecme than the 
above method 

But to apply more mathematically precise methods, we must meet 
one or two conditions 

(a) Either we must be able to assume that the distributions of marks 
in the batch of answer books sent to each examiner is approximateh the 
same as the distribution sent to every other examiner , 

(b) Or we must be able to define fairly precisely just exactlv what 
differences exist between the batches sent to different examiners 

Randomization It is to meet condition (a) that Dr H J Taylor 
introduced at Gauhati (Taylor, 1963 c) the system of “randomization of 
scripts” AH answer books for a region are collected m a central office 
The batches are then thoroughly mixed up From these mixed up answer 
books, a certain number of answer books is drawn at random and sent to 
each examiner It js therefore assumed that the distribution of abilities 
m each batch is approximately equal, 1 e that the mean and standard 
deviation of the scaled marks of each batch should be identical 

This would seem to be the ideal procedure, for it solves several 
other problems as well It is virtually impossible for any cand date to 
know where his answer book will eventually go And no centre’s answer 
books go to one examiner only 

However, there are admitted administrative complexities in this 
system, and for a iery large examination they may be overwhelming 
If condition (a) cannot be met, then condition (b) can There are 
several possible ways in which we can define fairly precisely ju«t exactly 
what differences exist between the batches sent to different examiners 
Head Examiner re examines One way grows out of the discussion 

above, of the role of the Head Examiner The results of his re-examining 
can be reduced to statistical constants, and applied mathematically to 
correcting the raw marks of each examiner This w ould, then, lead to a 
three stage rather than a two-stage marking process (0 examiners 
raw marks, (*i) examiner s scaled marks which he reports to the Head 
Examiner, and (ui) Head Examiner s final scaling of examiners’ marks 
(Possibly stage n could be eliminated) 

Ordinarily, the Head Examiner re-examines m order to issue instruc 
lions to the examiners What we are suggesting here is that he not issue 
instructions, but rather that he determine the “personal equation’ of 
each examiner and make the necessary corrections himself The actual 
corrections, of course, can be done by formula, applied by a statistical 
clerk (The “standard score” method discussed later, would be the easiest 
to use ) 
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All examiners examine a common batch Harper (1967) has made 

,hC ’irZSZSL* would * » have the Head Exammer •*> 
ten for preferably, 20) answer books, representing a wide range of 

time otherwise needed to examine "* -ELut, could be estimated 

them Theadeqttacyof thedistnbutionofqiialrtfetmdoewi 

q „,cMy by reading only T’S “ “ m “J»Sld live a' copy of 
would then be made so that each h£r „, c 3tlual hand-writing, 

each or these sample answer book E th masters, 

cte, could be 

as we did — or the contents cmU be typ along with the rest 

Each examiner would then mark P 5= the marks awarded 
of his batch The mea. . and ^distrdiuU ^ couIi thcn be d 
these sample anssver boofc ** .. -j^y coo Id be compared with 

determine his ••personal J examiners, and the results 

the average mean and djs h m ramer sealed to 

of all the answer books maraeo 

common standard b r ,ef test whose purpose is 

Use c f a “hnk <«>” A b „ m lte group, rather than to 

to assess the level and d 't ltib “ 1 ' 0 " ° f Thus it need not have a rehabih y 
assess any indisidual with 1 t because its only P ur P 

above 50 or so “ “ ‘ SteaT examiners The hnk est mayd * 

"■* together the sra.es and 

a short objective test or mBlllgmce or (pref“? iesofthe 

-or it may even be a b n0 , more than I 0 ; 15 .^mons 

aputudetest 30 carefully chosen . J, ofte , 

•examination time (T sen t for scoring by ,ndivi- 

are enough ) All lin The results are not a marks to 

and not to individual examiners ^ the range of m 

dual student’s marks, answe r books , fK i into scaled 

be awarded to each bateh * raw raalk s are 

After the link test firf proportion in each (r3|ghl mean 

marks, so as to .give « ' (Harper, >963 a) 0 ht „ 

handiest method is (ora ,|ion ran be made, samp j e of 500 

and standard deviatio dates is large. ‘ he “ 3 ” s of defining the 

(If the total number of a(ie quate for this P tomaa ; batch 

i,«v test results would be q test results ofea test results 
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50 and a standard deviation of 10 Then* whatever the raw marks are 
that Examiner A hands in, they are re scaled so as to have a mean of 50 
and a standard deviation of 10 The link test results of Examiner B s 
batch may ha\e had a mean of 47 and a standard deviation of 12 Then 
Examiner B s marks are re-scafed, so as to have a mean of 47, and a 
standard deviation of 12 

This may sound complex, but actually it is quite simple, and could 
be done quite quickly 

6 Scaling methods 

AH scaling methods require first that we specify the scale that we 
intend to u c c We may define this in terms of any two or more points 
and, using the normal distribution curve, build our scale around those 
points We may define it m terms of the average score and the pass 
percentage of an average group We may define it in terms of the mean 
and standard deviation of the group We may define it in terms of the 
percentage expected to fall in each Class or Division We may even 
define it in terms of the highest and the lowest score to be awarded any 
particular group 

After defining or describing our scale, wer must then decide on its 
applicability to a particular group If all the candidates are marked by 
the same examiner, there is no problem — the scale is applied directly 
to the examiner s results If there is a large number of examiners, but 
the answer scripts have been randomized (as described in the last section), 
again there is no problem The scale can be applied directly to the raw 
marks of each examiner But rf the system is such that different examiners 
can receive batches of different ability, then we need to apply one of the 
equating methods described above 

The next process is the actual conversion of raw to scaled marks 
There are two general types of methods The first type is based on the 
mean and standard deviation Many readers will be familiar with “stan- 
dard scores”, though perhaps they have not realized that marks can 
be converted to any arbitrary scale by these methods The second type 
is called * equi percentile’ or * normalized” scaling Again, many readers 
familiar with T-scores or stanmes, may not have realized that marks can 
be converted to any arbitr ary scale by similar methods* 

* It is interesting to note that both T scores and stanioes were originally defined as 
normalized scales McCall originally proposed the T-scale as a normalized scale 
anchored to the performance of 12 } ear-olds (hence the “T”) The 12-year-oId stand- 
ard has long since been dropped and now there is a growing tendency to use the term 
**T-5caJe * for any scale {normalized or standard deviation) with a mean of 50 and 
a SD of 10 The term * stamne" is also now used by many, far a standard deviation 
scale (as well as the original normalized scale) with a mean of 5 and a SD of 2. 
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7. Standard deviation scaling 

This is probably the most widely used scaling method. It has the 
advantage that it is quick and objective. It is particularly suited to the 
situations involving “equating" through common batches, link te.ts, or 
the* re-examining of small samples. . 

The simple relationship on which standard deviation scaling is 

based is this: 

x — m __ AT— M (15.1) 

£ 

where: , is a raw score (i.e. the mark given by ^ particular aarai- 

“eed-ipon common scale to which the raw marks 
Applying simpU algebraic* transformations, we can reduce the above to 

the form: 


X = '* + 


(«-;"')■ 


,ux + b 


(15.2) 


A simple example will show 15 PP nd 6 (of the ninety 
We have earlier used the result E ^ dto nces between unsealed 
marking ten experiment) to show b arb ,tcarlly d«= id ' ,0 u!e 

marks of the same answer books. _ be , ^ 15 . s marks 

Examiner 6’s scale as the both of them to sum 

Examiner 6‘s scale. (We cou ’ e* v x i 0 t his situation. “X nt, an e re 
other arbitrary scale just .. [a , 0 Examiner 6 s marks- 

to Examiner Id's marks and X > , and > . m arks have the 

It win be remembered that 

following characteristics: M ,&2 

m o 8.1 £ =. 6 52 

C «a 466 

Inserting these statistics in formula *^’ WC ° bta m‘ 

„ 652 o I I ** I.4x+>°“ 

X-4 66 x+ ( 466 hmark warded by Examiner 

We can now * "^original 

unscaled marks and the differences 
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show the scaled marks (note that Examiner 6 - s are unchanged), and the 
differences between them. 



Unsealed marks 


Scaled marks 


of examiner 

Candidate 15 

6 

Difference 

of examiner 
15 6 

Difference 

A 

7 

23 

16 

21 

23 

2 

B 

5 

15 

10 

18 

15 

3 

C 

8 

27 

19 

22 

27 

5 

D 

20 

34 

14 

39 

34 

5 

E 

It 

29 

18 

26 

29 

3 

F 

6 

26 

20 

19 

26 

7 

G 

2 

16 

14 

14 

16 

2 

H 

7 

17 

10 

21 

17 

4 

I 

10 

22 

12 

25 

22 

3 

J 

5 

13 

8 

18 

13 

5 

Average difference 


141 



3.9 


Let us note, again, that both of these were experienced examiners, 
and that they were marking exactly the same batch of ten answer books. 
The average difference between their unsealed marks was 14.1. Scaling 
reduced this to only 3.9. The smallest difference between unsealed marks 
was 8— the largest difference between scaled marks is only 7. 

It is obvious that this simple algebraic process can greatly increase 
the reliance that we can place on any set of marks. 

In practice, M and £ can be defined^ arbitrarily (based on an analysts 
of past examination marks). Or they may be defined by the marks of a 
link test, or of a Head Examiner’s re-examining, separately for each bateb- 
Whatever the basis for the scale, the mechanics of converting raw to 
scaled marks are obviously quite simple. In fact, it is not necessary to 
apply the formula to each individual answer book. It is simple to set 
up a table, listing raw marks in the first column and their equivalent 
scaled marks in the second. (With a calculating machine, such a table 
can be prepared in ten minutes or less.) Then the examiner, or a clerk, 
looks up each candidate’s raw mark in the left-hand column, reads 
off his scaled mark, and records it. 

8. Why not just add or subtract a constant ? 

One common but crude method of “scaling" sometimes used in 
India is just to adjust the mean of the marks. If an examiner’s mean is 
above the group mean, the difference is subtracted from each of his marks; 
if below, the difference is added. While better than nothing, this is quite 
inadequate. Such an adjustment would be justified if all examiners used 
the same distribution of marks. However, it is clear from our data that 
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lh , difTcrences in distribution (technically called the standard dev.at.on) 

lions (and to some eaten. even 'then the retone 
tl ,s the Stanford foliations, no the m ' whK h take 

»«*« ir«* rr - t rUS=.n are needed 
account or tort the r " !a " d lh m lhe same scale 

The importance or scal.n. diner ,h,spa,rorma.he- 

is repeatedly evident in our data k . f J> b ool£! Their marks 

mat.es eaammers marking the same 50 

correlated highly, r= 92 sundard 

Deviation 


1103 

1108 


8 45 

9 50 


sss 

Partly because ok this difference ms “Spread torn Dm 
who were placed >n Division V the mca ns would have made no 
s.on I to Fad by Examiner mcan suere identical for the t»o 
difference at all—since the rrora h , sl ory 

Even more striking is t n Standard 

Deviation 


Examiner D 

Examiner C 


18 7 
18 9 


196 

502 


Examiner C 

. n „ c ,, but the difference in the stan 
only added or sub- 

It is obvious that juste f^ood Terence in diese esampl 

9 “Normalised scaling methods scaling meihods is 

f the standard de , or un 

A big disadvantage of * ^ samc distribution as lfainc d 
this The scaled marks fol ,f our exanuneK s o 

scaled marks This w ° aT f nunal ely examiners are , cs used by 

in objective marking ^ dfferences dtviat , on but m 

there are often rathe mean and stan , stcps - in 

different examiners-no. on^m ^ ^ dlo „ n Uia. 

the shape of the dtstn 
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many examineis' sates ate seldom all equal For example, in our ninety 
marking ten experiment, we found “gaps" or several marks occurring 
at different places for different examiners of the same ten answer books. 
The difference between the highest failing candidate and the lowest 
passing candidate was as high as 12 marks for some examiners, while it 
was only 1 mark for others. A more common example is what Taylor 
has called the J-effect— the tendency to pile up marks just above the 
passing point. If we then scaled these by mean and standard deviation 
methods, we would have the absurdity of a J occurring at, say, 41 marks, 
or some other “meaningless" point. Finally, there is the tendency for 
lenient examiners to award a disproportionately large number of high 
marks, and for the stiff examiner to award a disproportionately large 
number of low marks. While abilities may not be completely normally 
distributed in some groups, it is unlikely that they are as skewed* as 
some examiners make them. In other words, forcing a normal distribution 
on the marks probably does less violence to the true distribution than 
allowing examiners to impose their own psychological quirks. 

All these problems can be avoided by having examiners report only 
rank orders and no marks at all. 

The “normalized" scaling methods force the marks into a normal 
distribution curve. Thus the scaled marks no longer have the same 
irregular distribution as the unsealed marks — which they would have 
with standard deviation scaling methods. The normalizing methods 
accomplish this by specifying that a certain percentage or proportion of 
the group is to be awarded a particular mark. In stanine scaling, for 
instance, the top 4% are awarded a mark of 9, the next 7% are awarded 
a mark of 8, the next 12% a 7, etc. In the familiar T-scale, the proportion 
awarded each score is smaller. Percentile ranks are another “normalized" 
scale with which many teachers are familiar. The top !% of the group 
have a percentile rank of 99, the next 1% have a percentile rank of 98, etc. 

The same thing can be accomplished just by deciding that, for a 
group of a given size, a rank (order of merit) of 1 is equivalent to a mark, 
say, of 80. A rank of 2 may be equivalent to a mark of 78. In the middle 
of the scale, the ranks will, of course, tend to bunch up. Ranks of 100, 
101, 102, 103, and 104 may all be equal to a mark of, say, 50. 

The marks equivalents of the various ranks can be decided on the 
basis of tabulating past examination results Or they may be set arbitrarily, 
with reference to the normal distribution curve. (A statistician can 
translate ranks into any score system, taking account of the fact that 

* This chapier is addressed both to the non-statistician and to the technical 
expert Therefore, some technical statements are unavoidable The educational 
administrator, teacher, or parent should just ignore these technical statements, as 
they are not essential to his understanding of the general principles involved 
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there is always a wider difference between ranks I and 2 than between 
say, ranks 49 and 50, out of 123 ranks ) ’ 

Ue examiners, then, report only ranks or order of meat They may, 
of course, mark their answer books in any way that they wish, but m 
the end they must put them in order of merit and just report this order 
They should also be restricted, say, to declaring not more than thre- 
answer books as “tied” or of “equal rank” The ranks are then converted 
into marks in the Board s office 


Rank order scaling is very easy to apply when the answer books 
have been randomized (In fact, it is probably caster than any other 
method, including the adding or subtracting of constants to equalize 
the means ) Where one of the “equating” methods has been used, the 
process becomes n little more complex A set of conversion tables will 
have to be prepared, for use with batches with different means and 
standard deviations But any competent statistician can do this He 
only has to remember that rank orders are a non - linear scale (like the 
more familiar percentile scale) which he is converting to scaled marks, 
which are a linear scale 


10 The od\ outages of rank order scaling 

Examiners who are asked to place answer books in rank order, or 
order of merit, are asked to do a task which is far simpler for human 
judgement than is the assigning of absolute marks They may, of course, 
begin by assigning some sort of marks But they are relieved of the 
necessity of going back and changing their marks, when they discover the 
question is easier (or harder) than they thought, and therefore they have 
been marking too high (or too low) They are also relieved of the agonizing 
decision about a borderline candidate, “Should I pass him or should 
I fail him?” They merely put him in his rank position.and let the statistics 
make the decision Again tn a situation of internal assessment, it is far 
easier for a teacher to justify a particular rank assigned a student in his 
class, than to answer the angry parent who says, “You could have given 
him a First if you had wanted to” Would not the parent be much less 
likely to say, “You could have declared my son to be better than Ram Das, 
if you had wanted to 1 ” 

In Chapter III, Section 5, and again m Chapter VIII, Section 5, 
we examined the difference between marks and rank orders We found 
that examiners agree on ranking the best candidate, but they do not agree 
on how many marks to award him Thus if the examiners award only 
rank orders, but the marks ore awarded by the Board s office, there will 
be a much higher degree of reliability in the marks of the candidates 
The absurd situation m which the same candidate can be awarded a 
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Distinction by one examiner and a low Fail by another is likely to be 
eliminated. 

To demonstrate this, let us convert the rank orders in the ninety 
marking ten experiment, to marks. There are several methods by which 
we could convert such rank orders to marks. We might do it on the basts 
of some arbitrary marking scale. Or we might find the average mark which 
each of the 90 examiners assigned his top candidate, the average mark 
each assigned his Number 2, etc. But perhaps it is best to take our clue 
from the “correct marks” of the various candidates. (Remember, of 
course, that this is an unusual situation. We do not know “correct marks” 
in any public examination. The specific type of conversion shown here, 
therefore, would not be applicable in ordinary practice.) 

Let us list the ranks and correct marks of the ten answer books 
(see Table 8,6). 



Mean 

Correct 


Rank 

Mark 

D 

1.2 

27 0 

E 

2.1 

231 

C 

3.3 

19 0 

I 

49 

15.5 

F 

49 

15.4 

A 

54 

14 5 

B 

7.7 

11.2 

H 

7.8 

109 

G 

86 

98 

J 

92 

88 


We could, of course, just assign Rank 1 a mark of 27, Rank 2 a mark 
of 23, Rank 3 a mark of 19, etc. But for a more exact solution, we 
draw a graph. On the ordinate we list the marks (from 7 to 28), and 
on the abcissa the ranks 1 to 10 We plot our points, and connect them 
with a curve. From this graph, then, we read ofT the following rounded 
values: 


Rank Mark Rank Mark Rank Mark 

1 28 4 5 16 ”1 rr 

I * S 26 5 15 8 5 10 

2 24 5 5 IS 9 9 

2 5 22 6 14 9 5 8 

3 20 6 5 13 10 8 

3-5 19 7 12 

4 n 7.5 11 


Table 3.4 may now be re-wriilen, in terms of scaled marks rather 
than of rank orders. (We have used the more detailed data in Table 8.5, 
which included ties as half-ranks.) See Table 15 I. 
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_ _ TABLE 13 1 

TheEffect OF Rani Obpeb Sow qv hukks Oam t , Bt mov S , N Tab u 3 1 


Scaled 

Marks 

D 

E 

C 

Answer fii 
I F A 

70ks 

B 

H 

G S 

Total 

28 

70 

10 

1 


1 





-hr 

26 

7 

6 

1 


- 





.« 

■Eu 1 

pEH 

55 

11 

2 

1 





IRh 


m 

4 

3 

1 

— 





IvS 


B 

12 

43 

11 

8 

2 


1 


IBfl 


B 


6 

2 

3 

2 


— 


Bip 


1 

- 

11 

20 

18 

13 


— 


WWM 

16 

i 

2 

7 

8 

14 

12 

1 

— 


44 

15 


l 

5 

19 

24 

23 

3 

5 

1 

83 

14 



1 

13 

10 

19 

4 

6 

3 1 

57 

13 



- 

6 

3 

6 

7 

7 

3 — 

32 

12 



- 

1 

3 

6 

20 

10 

S 4 

49 

U 



1 

6 

3 

4 

28 

27 

23, 11 

103 

10 




1 

1 

J 

9 

10 

9 5 

36 

9 





— 


8 

l£ 

17 17 

58 

S 





1 


10 

4 

29 52 

100 

Total 

90 

90 

90 

90 

90 

90 

90 

90 

90 90 

900 


V 

} 










It id interesting to compare Table 15 1 the scaled marks wlh 
Table 3 1 Ihe unsealed marks Perhaps the simplest comparison is just 
in the range of marks which are awarded each answer book For con 
vemertce we are rtprbdiic ng below the lowest and highest mark and 
the range 'for both the Unsealed and the Scaled marks 
' ' Unsealed Scaled 1 


D 17-35 \9 19-28-= 10 

E 11-38=28 13—28=14 

C 8-30 23 11—28=18 

I 7-26=20 10—24=15 

T 6-26=21 8-28 =21 

V 7-23=17 10-20 11 

B 5-17=13 $-16« 9 

H 3-19 17 8-20 13 

G 2-l7~)6 8-15= 8 

J 2-15 14 8-l-f= 7 
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Scaling of marks, by the rank order method, has reduced the range 
of marks awarded each candidate by an average of one-third. With 
unsealed marks, some candidates were placed in as many as five different 
Divisions — now only four are placed in more than two divisions, and 
none in more than three. The extremely high marks have been eliminated, 
and so have the extremely low marks. Obviously, many of the 
idiosyncracies of individual examiners have been greatly moderated or 
even eliminated by this simple process. One can place much more reliance 
on scaled than on unsealed marks. We have eliminated a major source 
of difference between examiners. Wc have substantially reduced the very 
large “chance” factor in traditional essay-type examinations. 


11. Scaling of marks in different subjects to a common standard 

All that we have said concerning the scaling of mirks of different 
examiners of the same subject, also applies to the scaling of the marks 
of different subjects to a common standard. The need for scaling examina- 
tion marks has been pointed out repeatedly (Mihalanobis 1934, Harper 
1963, Taylor 1963). The data in this study repeatedly pointed to this need. 

Perhaps the most striking example of the need for scaling was the 
difference in the mathematics scale and other scales, as has already been 
pointed out. This resulted in nearly as wide average differences between 
geometry examiners (in a supposedly “exact” subject) as between other 
examiners. It resulted in a much wider standard error of measurement, 
despite higher inherent reliability, for mathematics than for other subjects. 

As we have pointed out earlier, it is the standard deviation, not 
the mean or even the “maximum marks” which determines the weight 
a particular subject carries in the aggregate marks. Mathematics 
examiners, using a scale with a standard deviation twice that of any 
other subject, are thereby making mathematics twice as important as any 
other subject in determining a student’s rank in the aggregate marks. 

An example should make this dear. From Table 4.5, we can see 
that the top 5 geometry candidates were awarded (in the Original marking) 
50 marks out of 50. But the top history candidate was only awarded 36 
marks, the second, third, and fourth 31 mirks, and the fifth only 30 marks. 
Let us make a table of some of the high ranks: 

Marks awarded in 
History Geometry 

36 50 

30 50 

28 49 

27 43 


Rank I 
5 

„ 10 
15 
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Suppose Candidate A ranks First in history and I5th m geometry 
His total marks are 84 Candidate B is obviously superior, for he ranks 
First in geometry and only 10th in history Yet his total marks are only 
78 Because marks are not scaled, the inferior candidate receives the 
higher total marks, just because he happens to be strong in geometry 
A similar example could be worked out at the bottom of the scale to 
show that here the student who is weak m geometry is penalized more 
than the student, of equal ability, whose weak subject happens to be 
history 

Adding together marks of such widely differing scales is about 
as rational as adding inches and centimeters, or degree Fahrenheit and 
degrees Centigrade 

Let us close this chapter by again reminding the reader The scaling 
of examinations requires no change in the current type of essay 
examination, it requires no re education of examiners it requires no 
fresh instructions to students it requires no re-education of the public. 
Since scaling is to be done in the Board s or the university s office, all 
t hat is needed is an administrat ive de cision to apply a modem and 
thoroughly tested mcthod. for the immediate Improvement of the reh 
ability of examination marks Taylor ct al (1966 b) put the matter in a 
nutshell 

‘The notoriously large uncertainties of university examining arise, 
not so much from the unreliability of the examiners, as from the 
failure to [scale] When scripts are divided amongst many examiners, 
their marking standards must be brought to the same level 
by scaling If this is not done, the examiner who passes 15% of his 
scripts is put on the same tooting as the one who passes 85%, with 
the effect that poorer candidates pass and better ones fail ’ 
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Recommendations — Question-wise 
Multiple Marking 


O NE obvious method to increase the reliability of marks i s 
to have two or three (or more) exammers evaluate each answer 
book Since this doubles or triples the work involved, this is very 
expensive 

A second, less obvious, method is to have each question evaluated 
by a different examiner Since this does not increase the number of 
examiner-hours involved (each question is read only once), it adds little 
or no expense to current methods 

The ninety marking ten data gave us an ideal basis for studying 
these two methods 

Not only is the second method cheaper — it is also better 
These studies strongly support the contention that the reliability 
of the marks can be increased substantially, if each candidate’s work is 
marked by several different examiners, each examiner marking only 
one question 

1 Exammers who agree are likely to be wrong 

First let us discuss a practice which is widely used in higher degree 
examinations Two examiners are appointed, and they examine every 
answer book The answer books on which their marks agree are taken to 
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bo correctly marked Only the ansner books on which they disagree are 
submitted to a third examiner 

This seems to be, of course, "common sense” Unfortunately, 
“common sense” is often scientifically incorrect Dr H J Taylor and 
Mr L N Tluanga have written a booklet entitled The Problem of the 
TTurd Examiner (1964) They point out, and prove mathematically, 
that when examiners disagree their average is more likely to be correct 
than when they agree’ 

For those who may not be able to follow Taylor and Tluanga’s 
mathematical argument, the following very simplified version may help 
Suppose we have two examiners, both of equal ability and experience 
Examiner A marks one-third of his papers too high (let us call this 
+ 1) He marks one third too low (let us call it — 1) The remaining 
third are marked approximately correctly (let us call this 0, for “zero 
error’ ) Examiner B marks with the same degree of error 

Examiners A and B mark the answer books independently This 
means that each type of mark by Examiner A has an equal chance of 
being paired with each type of mark of Examiner B Thus, + 1 for 
Examiner A is paired with -J- I, 0, and —I for Examiner B Zero (0) 
for A is paired with -f 1, 0, and —1 for B A’s —I is paired, successively, 
with B’s + 1, 0, and — 1 Thus there are nine possible ways in which 
Examiners A and B can mark a given set of answer books 

Now let us arrange these 9 ways in order so as to bring out the 
salient points 

' Examiner 

A B Average 

Error 


+1 4-1 4-10 

0 0 0 

— 1 — J -1 0 


4-1 0 4-5 

0 —I — -5 

0 4-14-5 

-1 0-5 


The two examiners agree m only the first 3 out of 9 coses Of these 
three averages, l»o are quite arms, one being too high (+ 0) ami 
the other being too low (—1 0) . „ f rf 

The two examiners disagree *< defy m 2 out of the 9 cases-and in 
both these cases, the average is exactly right 
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In four cases, the two examiners disagree slightly. In all four* the 
average is only slightly wrong. 

When the examiners disagree slightly, the average is closer to the 
“correct mark” than the average is in two out of the three complete 
agreements. 

SUMMARY: When two examiners award different marks, the 
average is more likely to be correct, or nearly correct, than it is when they 
award the same mark. 

CONCLUSION: There is no justification for the practice of assign* 
ing a third examiner to only those answer books where the two marks 
given a candidate differ by 5, 10, or 15 points. If anything, the third 
examiner should be appointed when the two previous examiners agree. 

RECOMMENDATION: We have been talking about two experi- 
enced examiners. And we have been talking in terms of simple “agree- 
ment” and “disagreement”, without taking into account how much 
they disagree. 

It is certainly true that two good examiners will disagree with each 
other by fewer marks, on the average, than will two poor examiners. 
Two poor examiners may have several disagreements of 10 marks or 
more— two good examiners may have no disagreements of 10 marks 
or more. 

It is therefore recommended that: If the two examiners disagree 
by 10% marks or more in more than one-tenth of the answer books , then 
the entire batch of answer books (those on which they agree as well as 
those on which they disagree) should be sent to a third examiner. The 
third examiner should mark the answer books independently of the 
first two, and the final mark should be the average of the three examiners. 

Note that Vernon (1940) has recommended that there should always 
be three examiners, and the data given in this paper support this view. 
In India, however, this may not be economically feasible. The method 
proposed allows the advantages of three examiners where necessary, 
while requiring only two when a third is not really needed. 

2. How many examiners are needed ? 

By drawing at random from our ninety marking ten data, we can 
test the effect of having the marks of one, two, three, etc. up to ten 
examiners averaged. We have the advantage that we already know the 
correct mark of each answer book. This is the average mark awarded 
by 90 experienced examiners. Therefore, we can compare the average 
marks of two, three, etc. examiners with the correct mark. This tells us 
how much error remains in the average. 

Table 16.1 presents this study. Answer book A had a correct mark 
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riht averace of the 90 examiners) of 14 5 At random, the Brst examiner 
drawn for this answer hook was Erammcr 8 ° l, ' a " marks” The 

book 10 marks H,s error, therefore, was 10- “Warded 

second crammer drawn at random was Examiner 3 

™=== 

TABLE 16 ! 

A r R<OE cr N mTrkTm”°n'oV 90 Cr Each of 

Correct Mark ( History Answer Books — 


Candidate 


a m i n e r 


1st 2nd 
A Examiner (83) (32) 
Mirk 10 0 

Average *0® 

Error — ( 5 — » 5 

D Exinvr (93) BH 
Mark 9 » 

Average » £ 

Error — 22 — ^ 

C Examiner (20) (16) 

Mark 22 g 0 
Average " “ 

Error 3 0 3 u 

D Examiner (66) (63) 

Mark 30 22 

Average _ 2 ?q 

Error 30 
E Examiner (43) BJ* 
Mark 25 » 

SK* !•“« 

F Examiner (|4> (55) 
Mark > 9 , 

Average , '« . 

Error 3 6 o 
G Examiner (87) (56) 

Mark 1 l « 

Average , , 

Error —2 8— 2 -s 
II Examiner (29) (52) 
Mark >° jfo 

Average , 

I Examiner (20) (88) 

Mark 25 j* 

Average . . 5 < 

Error 9 5 M 

j Examiner (43) w 
Mark Q 


3rd 4lh 5th 6th 7th 


9th 10th 


Mark 


fUUUtr 

g. & -«•? h -! s 

i imm- 
iihliw- 

ill I II I f*- 

Uiinn- 


98 
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If you look over Table 161 you will find several examples of 
Taylor and Tluanga's thesis Where examiners agree (like Examiners 80 
and 32 for Answer Book A) their average is frequently quite wrong 
But look, for example, at the first two examiners of Answer Book F. 
They disagree widely (19 and 12), but their average is almost exactly 
correct In general, when examiners disagree their average is more likely 
to be correct than when they agree Remember that the examiner reli- 
ability for these data was 83 

Table 16 2 summarizes the results of our multiple marking It 
shows the number of times errors of each size occurred when the results 
of two, three, four, up to ten examiners were averaged The bottom row 
shows the average error The average error for a single examiner was 
3 2 When we used two examiners, we cut this error by about one-third 
Three examiners cut it by one-half But it required a:\craging the results 
of ten examiners, to cut the error from 3 2 to 1 0 It required at least five 
to reduce the error to 1 3, on the average 

TABLE 162 

Summary of Table 16 1 Showing tiie Number of Times a Difference of 
Each Size Occurred, on Each Re examining, Between the Average 
of Two to Ten Examiner s and the Correct Mark (Note Diffe- 
rences Tabulated without Regard to Sign Marks are Out of 50) 


Size of difference 
from correct mark 1 

2 

3 

Examiner 
4 5 6 7 

8 

9 

10 

Total 

fre- 

quency 

999 

1 










1 

8 89 












7-7 9 












6-69 












5 59 


1 









1 

4-4 9 

t 

t 









2 

3 39 

3 

1 


3 


1 

1 

1 



10 

2 29 

2 

2 

3 

4 

2 

2 

2 

2 

3 


22 

1-1 9 

1 

2 

3 


5 

3 

3 

3 

1 

5 

26 

0-09 

2 

3 

4 

3 

3 

4 

4 

4 

6 

5 

38 

Average Error 

3.2 21 

15 20 13 14141313 10 



3 Each examiner marks a different question 

It is doubtful that any Board will appoint ten examiners for each 
answer book — or even five If it takes, say, a half an hour to mark an 
answer book, then five examiners would take two and a half hours — 
increasing the cost of marking by five , 

But suppose, instead, each examiner reads a different question** 
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Then we hive five examiners, yet the total time taken in examining an 
answer book remains only half in hour 

Harper (1963 c) has proposed that the candidates answer each 
question in a separate answer book Each examiner, then marks only 
one question (The article cited gives many practical suggestions as to 
how this might be earned out as well as theoretical discussions of its 
vilue )• 

The ninety marking ten experiment provides data to test Harper s 
proposal At random we drew Examiner 14 We assigned him Question 1 
Thus the mirks which Examiner 14 had assigned to Question l in each 
answer book were written down as that answer book s marks for 
Question 1 Then at random ue drew Examiner 24 and assigned him 
the second question The marks he had assigned that question in each 
answer book, were recorded as that answer book's marks for that question 
Similarly, examiners 36, 41, and 48 were assigned the remaining three 
questions Thus, Answer Book A’s total mark is made up of the mark 
Examtncr 14 assigned to it on the first question, plus the mark Examiner 
24 assigned to it on the second questton etc 

We repeated this experiment four different times, each time drawing 
5 different examiners and assigning a single question to each of them 
Table 16 3 shows the results of these four experiments For example, 
Answer Rook A s correct mark is 14 5 The total of the first five exarm 
ners’ marks was 15 5, an error of + l 0 The total of the second five 
examiner’s marks was 105, an error of — 4 0 The total of the third 
five examiners marks was 12 5, an error or —20 The last set of five 
examiners awarded a total mark of 16 5, an error of 2 0 

It is interesting to compare these results with those in Table 16 2 
First we ma> tabulate the errors of different sizes for case m comparison 


4-4 9 

3 

339 

2 

229 

7 

M 9 

J5 

009 

13 


40 


* The Vice-Chancellor of Kanpur Unners ty Mr Rad ha Krishna prww*J»J • 
m-Vifrm n vThich ihe senior author was present a similar but in some ways simpler 

gsssssii 

5 «J=SsSSj^sSffs^:a3Ttti 

advantage lha* ifcoutdV applied immediately without any change in present answer 
book, procedures 
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About one-third of these are errors of Jess than one point, which is 
comparable to the results achieved with four or five examiners reading 
each entire answer book. In general, the above distribution is slightly 
better than when two examiners read each entire answer book— i.e., 
when twice as much work is done as in Harper's method. 

The bottom line of Table 16.3 gives the average error for each of 
the four experiments. (We might, of course, have done many more such 
experiments — but the average errors were so close, that it did not seem 
necessary to replicate further.) These average errors range from 1.37 to 
1.69. All of them are lower than the average error of a pair of examiners 
in Table 16.2; and half of them are lower than the average for three or 
four examiners. 

it is evident that it takes at least three examiners reading each entire 
answer book, to get results which are more reliable than those obtained 
by having each question read only once but by a separate examiner. 
Since the latter method costs only half as much (in time and examiner’s 
fees) as having two examiners evaluate each entire answer book, it is 
definitely the preferred method. Even if we can afford to have three 
examiners evaluate each entire answer book, it is still questionable if 
the slight gain in accuracy would be worth the extra cost. 

TABLE 163 

Marks or Candidates Compared wmi Correct Marks when a Dwtekent 
Examiner Marks each of the Five Questions. Each Examiner Marking 
the Same Question in All 10 Answer Books. (Note: All Candidates 
Answered the Same 5 Questions). 



Four sets of 5 examiners chosen at random 


Cor- 


First 

Second 

Third 

Fourth 

Mark 

Candidate 

Mark Error 

Mark Error 

Mark Error 

Mark Error 


A 

15 5 1.0 

105—40 

12 5 —20 

165 

20 

14 5 

B 

105 — .7 

10 —1.2 

10 —1.2 

10 

—1.2 

11 2 

C 

20 1 0 

19 0 

16 —30 

20 

1.0 

19 0 

D 

24 5 —2.5 

28.5 1.5 

26 5 — .5 

28 

1.0 

27 0 

E 

23.5 .4 

23.5 .4 

22 5 — .6 

23 

— .1 

23.1 

F 

16.5 U 

12.5 —29 

14.5 — .9 

16 

.6 

154 

G 

10 5 .7 

85—13 

5.5 —4.3 

8 

—1.8 

98 

H 

65 —44 

9 —19 

10 5 — .4 

7.5 

—3.4 

109 

I 

16.5 1.0 

13 —2.5 

155 0 

13 

—2.5 

IS.5 

J 

Mean 

7 —1.8 

"l46 

10 1.2 

1.69 

8—8 

1 37 

65 

—23 

1.59 

88 
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Thus it is both cheaper and better to have each quest 01 read 
by a separate examiner The results of each candidate are then based 
on as many examiners as there arc questions. Apparently at least some 
of the examiners individual idiosyncras es can be averaged out by 
this simple method 
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The last two chapters have dealt with specific recommendations 
which would improve marks without any change whatsoever in the present 
pattern of examinations. Each of them would require only an administra- 
tive decision, to introduce scaling, and to introduce question-wise 
marking of answer books. 

But with examination reform “in the air”, it is hoped that we will 
go much further. 

Twenty-two years ago the Uni\ ersity Commission Report (1950) — 
commonly known as the Radhakrishnan Commission — pointed out that 
the traditional examination is “invalid . . . inadequate . . . subjective 
and therefore not reliable" and recommended (p. 329) "the introduction 
of . . . objective examinations in the universities of India at the earliest 
possible time”. Perhaps the first step is for someone to decide what 
“the earliest possible time” is. 

But short of such radical change, there is much that can be done 
to improve the more familiar type of examination. One is tempted to 
discuss the many proposals made: More freedom to paper-setters to 
determine the most valid method of assessment (instead of being required 
always to follow past patterns) — better questions — better training of 
examiners — -use or multiple forms of an examination to frustrate copying 
— more experimentation with “open book” examinations — use of 
machines (and even carbon-backed forms!) to free examiners from the 
unnecessary paper work which reduces reliability, because it reduces 
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the time available for examining— etc , etc , etc But we will confine 
ourselves to six recommendations which arise directly out of these two 
studies 

/. Stx recommendations 

1 Eliminate options The data show clearly that if no choice 
of questions is allowed, then both examiner and content reliability— 
and, hence, marks reliability— are increased significantly Options must 
be allowed when there are alternative textbooks, os m languages, of 
course Otherwise, as all expens agree, they are indefensible 

2 The elimination of options will be more acceptable if the number 
of questions is greatly increased The traditional, long essay type question 
undoubtedly elicits— i e requites for its answering— such skills as clear 
thinking, organization, creativity, style, etc But whether it actually 
measures these skills to any rehab/e degree is highly doubtful There is 
probably nothing valuable measured by a Jong answer that cannot be 
measured just as well by a one or two-page answer It is possible to set 
as many as 20 one page questions in a three-hour examination The 
present “short notes’ type of question provides a precedent, though 
the form needs to be greatly improved, dropping the present emphasis 
on factual memory, and substituting more questions that say “compare", 
“contrast", “what is the difference between", 4 discuss the relationship 
of to ", “outline the essential features of etc Many of the 
present “short notes ' questions could actually be answered quite 
adequately in a sentence or two — which suggests a liberal use of even 
shorter answers "My experience with Air University demonstrated (hat 
good questions requiring answers of one or two sentences in length 
showed as high reliability and higher item validity coefDcients than items 
m objective tests given under the same time limit, even though the objective 
tests contained about 50%— 75% more questions " (Warren G Findley, 
personal communication) 

3 Even with two page, one page, or one sentence answers, 
efforts must be made to improie marking instructions, and framing of 
examiners , so that inter-examiner reliability can be raised to the 90 
which GuIIikscn (1950) recommends as minimal Thu mean* nearly 
doubling in real terms (i e m terms of Fisher's z) most of the examiner 
reliabilities found in this study Better marking instructions is the minimum 
first step But training of examiners is also needed If an) cummer 
feels he needs no training. Jet him do the experiment of marking the same 
100 answer books twice If the correlation coefficient is less than 90 
he has lost his case 

4 We have seen that many examinees are falsely labelled railed , 
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when actually it is the examination system which has failed Many of 
these injustices can be eliminated, by using the grace marks system proposed 
by Taylor and Tluanga (1963 a) Unlike the traditional rather arbitrary 
methods, Taylor’s system is based on the reality of examiner (and test) 
error, and therefore makes a rational (rather than purely arbitrary) 
and fully fair correction (See Appendix B for a version of Taylor’s 
paper which may be somewhat more comprehensible to the non-mathe- 
mahcian than the original ) Incidentally, Taylor and Tluanga’s tables 
are based on the standard error of marking But tables based on the 
much wider standard error of measurement could easily be derived 

5 Routine statistics should be calculated after c\ery examination 
At the very least, a Board should know the mean, standard deviation, 
percentage in each Division, and content reliability of every examination 
which it conducts Probably some sort of item analysis should also be 
carried out And if possible, a few answer books should be re-examined 
to determine examiner reliability, marks reliability, and measurement 
error The relative uniformity of the results in our studies suggest that 
these statistics can be derived quite adequately from a rather small sample 
of scripts in each subject, say 500 at the most This is particularly true 
if a stratified random selection system is used— first selecting a specified 
number of examiners at random, and then selecting a specified number 
of answer books at random, from the work of each 

6 Finally, if examination reform is ever to become a reality, 
widespread publicity and education of both teachers and the general public 
is needed Talented journalists should be induced to study the many findings 
of research, and to reduce them to simple articles for the popular press 
It is a mistake to assume that the scientists and educators who are 
competent to do the research are necessarily the best ones to explain it 
to the public Despite extensive research, there is still an appalling lack 
of knowledge and understanding even among teachers and educational 
administrators, as to the extent of examination unreliability, the reasons 
for it, and the well developed technology already available to correct 
it The task is an educational one, and in dealing with the layman it is 
often the competent journalist who is really the best educator 

2 In conclusion 

The question raised by the Secretary of the Board of Secondary 
Education, who provided the answer books and examiners' addresses 
for these studies, was this “Are the variations within permissible limits” ? 

What are the “permissible limits” ? 

The psycho metrician would say that the “permissible limits” are 
the greatest degree of precision which can be obtained, within the goals, 
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aims, purposes, and financial resources of any particular examination 

Are there any values in the present methods of constructing 
administering marking, and scaling examinations which arc great enough 
to justify the almost complete lack of precision in the end product— the 
marks assigned the student 7 

This is a question that cannot be answered by the psjchomctnctan 
It is a question which can only be answered by the educational authorities 
in charge of the system 

The Education Commission (Government of India 1966, pp 290-291) 
hit the nail on the thumb, so to speak, when they said, 

“ what is lacking ts not knowledge, but 
will, courage and pcrscverence 
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Abstract and example from Harper's " Internal consistency Reltabl- 
hty of Traditional type Examinations w hen a Choice of Questions is 
Allowed f 

Ebels intra class correlation (see Guilford, Psychometric Methods, 
2nd Ed , pp 383-385) uses analysis of variance to derive the reliability 
of a set of ratings This same approach can be used for essay type examin- 
ations when no choice of questions is allowed, and is equivalent to a gene- 
ralized form of Kudar Richardson formula 20 The analysis of variance 
may be reduced to an algebraic formula as follows 

_ . kr'Zyp -f- (XX)* - rSCS3T,y - 
** (fc - /) (rlXSA*) 1 - (ZX)*\ 

where, r ~ number of rows, i c the total number of examinees in 
the table 

k = number of columns, i e the number of questions to be 
answered by each examinee 
X = the mark on any single question 
EA' 1 =* the sum of the squares of all the Xs (Each X must be 
squared and all squares added ) 

XX = the sum of all the Xs 

(2A')* = the square of the sum of all \ ’s (Be sure to distinguish 
this from XX * TheJatter requires that each A' be squared 
first, and all these squares summed, the former that 
all X s first be added and the sum be squared ) 

S(£A*,)* = the sum of the squares of the row-sums (The sum of each 
row, i e the tot3l mark of each candidate, is squared, 
and these squares are added ) 

r(£AV* » the sum of the squares of the column sums. (The total of 
all the marks awarded each question is squared, and 
these squares arc added ) 



274 


RESEARCH ON EXAMINATIONS IN INDIA 


Harper’s paper adapts this to several types of cases where a choice 
of questions is allowed in a traditional-type examination. In the simplest 
case, the candidate has a free choice of questions (e.g. “answer any 5 
out of the 9 questions”), and all questions carry equal marks. It is shown 
that in this case the examiner assumes that all questions are interchange- 
able, and therefore are of equal difficulty; and therefore the formula 
reduces to the following form : 

rw 1 (Jfc - /) Ir^SX.y - (SAT] 

The following example is designed both to illustrate the method, and 
to show the effects of choice on essay-type examination reliability. The 
difference in item difficulty is deliberately exaggerated to demonstrate these 
effects. 


Let us assume there are six candidates, A, B, C, . . F. The examina- 
tion has 7 questions, out of which each candidate is to answer 4 only. 
The questions differ in difficulty. Let us further assume that we know how 
each candidate might have scored if he had answered all of the questions. 

The first table shows the difficulty of each question for each candidate, 
and also his marks if he chose the four best questions. 


Questions 


1 2 3 4 5 6 


7 


Marks if 
chose best 


Average Difficulty 8 7 6 5 4 3 3 

Marks of Candidate A 10 9 8 7 6 5 5 34 

B9876544 30 

C8765433 26 

D7654322 22 

E 6 5 4 3 2 1 1 18 

F5432100 14 


Notice that, on each question in this artificial example, the candidates 
are ranked in order of ability. The reliability of this examination, in this 
Form, is obviously perfect. The examiner has made no assumptions about 
the difficulties of the questions, when be requires all candidates to answer 
all questions. 

But when he allows a choice, the examiner apparently does so on the 
assumption that the questions are interchangeable, that is they are of 
equal average difficulty— and that each candidate will choose the questions 
on which he thinks he can do best. But we know that examiners cannot 
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judge difficulty accurately — and that candidates cannot always guess 
hoiv examiners will mark a particular question So, when a choice is allow- 
ed, the situation is likely to be like the next table And this table is used 
to illustrate the calculation of reliability Each candidate is allowed to 
answer “any 4 out of the 7 questions” Notice that the ranks of the candi- 
dates are changed because of their choices (Gayen's studies as well as 
Harper's have shown that this does happen ) 


Candidates [ 

A ~ 


2 S 4 5 6 7 


12 23 25 23 17 7 8 


Marks (XX, V Square each mark 

and sum the squares 

23 529 

26 676 The sum of the 

18 324 column sums « the 

20 400 sum ol the marks 

14 196 This provides a 

14 196 check, both here and 

— wilh the total of the 

US 2321 frequency distnbu 

lion below 

XX X(XX,y 


Prepare a frequency distribution of the marks awarded to each 
question 


l jl m 

1 8 64 

3 2J 147 

4 24 144 

7 35 17S 

3 12 48 

3 9 27 

3 6 _12 

24 7i? 617 

kr rx 




I 

1 


(4) fff> (617) — 6(2321) — 
“ (4 - 1)1(6) (2321) -lnurJ 
14 808 — • 13 926 
~ 3 (13 926 - J3 225) 


8S2 

2103 




J- 419 
5S 
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A Scientific Substitute for Grace Marks 


The present grace marks rules of the Board are admittedly rather 
arbitrary, and I have been asked by the Secretary of the Board whether 
they can be placed on a more scientific basis. 

Yes, I believe they can. 

What follows is based on a paper by Dr. H. J. Taylor entitled “Grace 
Marks” and published by the U.G.C. in a booklet called Three Studies in 
Examination Technique. (Those who wish a fuller explanation than is 
given below, may refer to the original paper.) Dr. Taylor is an eminent 
British scientist (physicist) and educator, who has spent most of his life 
in service in India. He published this paper while he was Vice-Chancellor 
of Gauhati University. 

No measurement is perfectly accurate. There is a "margin of error” 
in every kind of measurement- If we ask ten men to measure the length of 
a table to the nearest millimeter, we will probably receive ten different 
answers. Even so important a “physical constant” as the speed of light is 
known only within an accuracy (called “probable error”) of plus or minus 
100 meters per second. 

The purpose of an examination is to measure the degree of knowledge, 
understanding, etc. possessed by the student. Like all measurements, an 
examination mark is subject to error. This is no reflection on the integrity 
of the examiners. It is the fundamental nature of measurement. As every 
scientist knows, measurement without error is impossible. 

Actually, there are two major sources of error in examination marks: 
(a) if the candidate had been asked different questions, his marks might 
have been different; and (b) if his answer book had been marked by a 
different examiner, his marks might have been different. 

The mark which the candidate would have obtained if there were no 
measurement errorin examination marks may be called his “correct mark”. 
(This mark is, of course, unknown — and, unfortunately, unknowable.) 

It seems reasonable to state that: 

1. If a candidate's “correct mark” is probably below 33% in a 
subject, then he should fail in that subject. 
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What does this mean 1 * It means that, in a single subject, 1/3 of the 
candidates whose “correct mark” is 33 will be awarded marks which are 
either 28 marks or lower, or are 38 marks or higher (See Supplement A 
for details) To put it differently, 1/6 of all the candidates with 28 marks 
or lower should fia\e passed, because their “correct mark” is 33% Of 
course, it is also true that 5/6 of these candidates have “correct marks” 
of 32% or lower, and therefore are correctly failed 

Unfortunately, we cannot know which candidate has been failed by 
“measurement error” and which has been properly failed This is because 
“measurement errors” are randomly distributed, i e according to the 
laws of chance 

3 “Measurement error ” in an examination 

Fortunately, when the candidate answers several papers, these errors 
tend somewhat to cancel each other out Thus the chance of a candidate 
being failed “by error” in two subjects is far less than the chance of such 
failure in one subject As every mathematician knows, in such cases we 
multiply the probabilities (Non- mathematicians may see Supplement B 
for a demonstration ) Thus if the probability of a mark of 28 in one 
subject is 1/6, the probability of marks of 28 m two subjects is only 1/36 
(1/6 X 1/6=1/36)— and the probability for marks of 28 in all subjects, for 
a candidate whose correct mark is actually 33% m each of them, is only 
1/46,656 

Obviously, wc should have no hesitation in fading a candidate if 
there is only this microscopic chance — one m forty-six thousand — that 
he should have passed However, if there is one chance in six that we are 
doing an injustice in fading him, then he seems to be a* candidate for 
“grace marks” 

4 Setting the standard 

But between these extremes, where shall we draw the line 4 * * 7 At what 
level of “probability that he should have passed” do we pass a candidate 7 
At what level of probability that his “ correct mark" is 33% do we say, 
“This probability is so small that we must ignore it, and fad him ”? 

Dr Taylor has proposed an eminently logical solution If a candidate 
has exactly 33% marks in each of his six subjects, we pass him without 
question Calculate the probability that such a candidate’s “correct mark” 
is actually 33% or above in every subject This probability is then the 
mrmmum level of probability required for passing the examination 

If the candidate obtains a mark of 33% in a particular subject, then 
v.e find from statistical tables that the chances are 53 98 out of 109 that his 
“correct mark” is 33% or better (The chances are a little better than 50% 
because a “correct mark” of 32 5 would be considered a 33 ) Thus the 
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The likelihood of a good candidate being failed by “measurement 
error” is much greater than the likelihood of a poor candidate being passed 
by “measurement error”. Therefore this system, which takes scientific 
account of “measurement errors” and allows for them, is likely to raise 
the pass percentage of the Board. And it is proper that it should do so, 
because the high failure rate was partly due to errors of measurement. 

7. How to calculate “ passing probability ” 

First we will describe the detailed method for those who “want to 
know what they are doing”. However, this detailed method is not re- 
commended— it is much easier to use Dr. Taylor's table. So those who are 
not interested in the details can go right on to the next section. 

(I) Subtract 33% from each of the candidate’s marks. Suppose, for 
example, the candidate’s marks are 36, 33, 30, 34, 32, 36. We do it this way: 


Marks obtained: 

36 

33 

30 

34 

32 

36 

Subtract 

33 

33 

33 

33 

33 

33 


+ 3 

0 

-3 

+ i 

- I 



(2) Find the “probability” for each of these differences. If you want 
to calculate these yourself, you can calculate each (X-32.5)/<r (remember 
that <7=5), and look up the p value m a table of the normal curve. Then 
multiply the six p values by each other, to obtain the total probability. 

Deviation + 3 0 — 3 +1 — 1 3 

Probability .76 x .54 x .31 x .62 X .46 X .76 

If you multiply these out, you will obtain a total probability of 
.027576 or 2.76/100. Since this is better than our standard of 2.47, the 
candidate has Passed. 

NOTE: Under usual “grace marks” rules, this candidate would 
have failed, since he “failed” in two subjects and earned no grace marks. 
However, it has been shown that his failure is due to measurement error, 
not due to his own lack of ability. He is actually a better candidate (as 
is shown by his higher “passing probability”) than the candidate who 
receives exactly 33% marks in each subject. 

8 Simplified calculations using Dr. Taylor's Table 

Dr H. J. Taylor has provided the attached Table which eliminates 
the need for multiplication. The multiplications have been done in the 
Table. Ittherefore requires only a few seconds to find the “passing proba- 
bility” of any candidate. 

(1) Subtract 33% from each of the candidate’s marks. Suppose, for 
example, the candidate’s marks are 36, 33, 30, 34, 32, 36 We do it this way: 
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Marks obtained 36 33 30 34 32 36 

Substract 33 33 33 33 33 33 

Deviations -f 3 0 — 3 + I — J -f 3 

(2) Find the first deviation in the top of the Table and read off the 
number immediately below this In the line below +3 you will find 76 
(This is the probability of a single deviation of +3 ) 

Next find this Probability in the left hand column of the Table 
Then go across this row unt 1 you reach the column of the second devia 
tion In our example We find the row numbered 76 Going across it to 
the 0 column we find the number 41 (This number is the product of the 
two probabilities 76 X 54= 41 ) 

Next find this new Probability in the left hand column of the table. 
Then go across this row until you reach the column of the third deviation 
In our example We find the tow number41 Going across to column — 3 
we find the number 13 (This is the product of our three probabilities so 
far 76 X 54 X 31= 13) 

We continue this process until we have multipl ed by all the devia 
tions Dr Taylor suggests that we avoid the last few rows of the table 
where numbers are small and therefore rounding errors are larger 
It is recommended that as soon as an entry less than 10 is reached it 
should be multiplied by 10 to carry the calculation back to the top of the 
table All the steps in our example are as follows 

Row 100 column -)-3— 76 
Row 76 column 0~4i 
Row 41 column- — 3 — 13 
Row 13 column S 
(Multiply 8X10-80) 

Row 80 column — 1—37 
Row 37 column -)-3 — 28 

We divide our final answer by 10 (to correct for previously multiply 
mg by 30) and find our answer to be 2 8/100 

(3) If this final value is greater than the minimum passing pro 
babdity of 2 47 the candidate passes If the final value is less than Z47 
the candidate fails (This cand date passes His failures in two subjects 
were due to errors of measurement ) 


P Notes 

1 With a little practice these calculations ran be jtonc m a e 
seconds using this Table A si ding stale ran be dcagari 11° _Mp 
Che proper deviation m any given row Subtraction of 33 from each marl. 
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can be done in the head, without writing it down. 

2. The descriptions above are based on an examination with 6 
subjects with 100 marks maximum for each. Where the number of papers 
or the number of marks are different, the necessary tables and method 
can easily be worked out. 

SUPPORTING MATERIALS IN SUPPLEMENTS 
A. Distributions of error. 

B Why probabilities are multiplied. 

Supplement A — Distributions of error 

Suppose the “correct mark” of a candidate is actually 33%. Suppose 
his examination answer book is marked by, say, 100 experienced exami- 
ners. What sort of marks will be receive? 

Some examiners will award exactly 33 — but most will not. 

Many will award marks of 34, or 35, or 36. 

Many will award marks of 32, 31, or 30. 

Fewer will award marks above 38; but this same number will award 
marks below 28. 

The marks will be distributed “at random” above and below the 
“correct mark”. 

Actually, if we take our “standard error of marking” as 5 marks, 
and our “correct mark” as 32.5 (which is the lower limit for 33%), the 
distribution will be as follows. On the average: 

1 examiner will award 46 marks 

2 examiners will award 43-45 marks 


15 

„ „ 38-42 

28 

„ „ 34-37 

8 

„ „ 33 

28 

„ „ 29-32 

15 

„ „ 24-28 

2 

„ „ 21-23 


1 examiner will award 20 marks 
100 examiners 

Notice that 46 of these 100 examiners have awarded “failing” marks 
to this “passing” candidate. But 54% of the examiners have actually 
passed him. Thus the chances are slightly better (54:46) that a candidate 
whose “correct mark" is 33% will be awarded a passing than a failing 
mark. However, we cannot easily afford to ignore the fact that 46% of the 
examiners failed a candidate who should have passed. 

Notice that we could re-wnte this table on the basis of a “correct 
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° f ,^ 2 ^ r ’ more wactJy, 3! 5)—i e a candidate who should fail It 
will look like this 


1 awards 

; 45 

marks 

2 award 

42-44 

marks 

15 „ 

37-41 


28 „ 

33-36 

y> 

8 

32 


28 „ 

28-31 

M 

15 „ 

23-27 


2 

20-22 


1 awards 

19 

marks 


This candidate, whose “correct mark’* should fail him, is failed by 
54 examiners — but is passed erroneously by 46 examiners. 

Dr Tajlor’s “passing probability” method takes into account both 
the probability that a “pass” mark is awarded to a “fail” candidate, 
as well as the probability that a “fatl" mark is awarded — due to chance 
errors of measurement—to a “pass" candidate 

If these results look extreme, consider the following table. This 
is based on actual data Copies of the same Higher Secondary School 
History examination answer book were sent to 90 experienced examiners 
The marks out of 50 awarded by these examiners were as follows 


2 examiners 

awarded 

23-26 

9 

„ 

20-22 

28 


17-19 

22 


14-16 

18 

„ 

11-13 

9 


8-10 

2 


6-7 


marks 


The average mark awarded by these 90 experienced examiners was 
15 4 The candidate's “correct mark * is dearly a failure— yet 39 of these 
experienced examiners passed him, two of them in the Second Division 
If you will consult chapter III where these data are published you will 
find that some answer books received an even wider range of marks 


Supplement B — Why probabilities are multiplied 

Why do we multiply the probabilities by each other (rather than 
using some simpler process like adding)? The following demonstration 
will show how this works _ . 

Suppose we have a com. It has two sides, named Heads and Tans 
Let “H" mean Heads, and “T” mean Tails 
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Ifwc toss the coin, there are two ways in which it can land on the 
table: 


any of the following ways: 

(1) H H 

(21 H T 

(3) T H 

(4) T T 

We saw that the chance for one com coming up Hi :ads was 
Count the chances for two corns to boll, come up Heads. W ^ Heads 
one of the four patterns shows two Heads— so the chan 
is 1/4 Notice that 1/4 is 1/2X1/2. . . ich they 

Let us toss three coins. There are now eight w V that 

nught arrange themselves multiplied 

the probability of three heads is 1/2X 1/2X l/z-i/e i 
by itself three times). Is this correct? 


(1) H H H 

(2) H H T 

(3) H T H 

(4) H T T 

(5) T H H 

(6) T H T 

(7) T T H 

(8) T T T 
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The Problem of the Crossed-out 
Tick Marks 


It ts in the nature of scientific research that it js an exploration into 
the unknown And it is in the nature of the unknown that the problems 
to be encountered in its exploration cannot always be dearly foreseen 
One such problem which was not adequately foreseen was the possible 
effect of the first examiner on the second It was not foreseen because 
the planners did not realize that so many lmes, ticks, and crosses had 
been placed m the answer books When this was discovered, about two 
months of extra labour went into trying to disguise them 
Was the disguising effective? 

We have seen that, in comparison with studies in other countries, 
our examiners seem to have somewhat higher examiner reliability than the 
average We have discussed the several reasons why Indian examiners 
might be expected to have higher examiner reliability than, say American 
examiners. However, the nagging doubt remains Was the disguising of 
previous marks effective 1 Or did the previous marks spuriously raise 
examiner reliability’ 

Four lines of evidence have been gathered to explore this question 
1 A comparison with other studies done in India should have been 
useful Unfortunately, few complete parallels were found The subjects 
involved differed, the experimental technique differed, etc. The most 
near/y parallel is the recently completed study of the junior author 
(Misra 1970), in which his means are based on two completely indepen 
dent markings, as against our ten of questionable independence His 
examiner reliability for history had a mean of 71, as against our median 
of 74 for ' Other re-examining His mathematics mean was 81 as against 
ours of 95 Thus if there is any influence of the previous marks, it is not 
in the paper which is considered to be most subjectively marked Taking 
these facts along with comparisons with other studies, a conservative 
statement js this Our examiner reliabilities may be slightly higher than 
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for somewhat similar studies done in India But no more positive state- 
ment than this can be made 

2 We asked the examiners A questionnaire was sent out on several 
aspects of the experiment, and 36 of the 40 examiners sent in their replies 
Half of them claimed to recognize the answer books that they had pre- 
viously marked— but when asked how many of the 100 these were most 
of their answers were so far away from 50 (the correct number) as to 
suggest that they did not really recognize their own previous work 
When asked, “Do you think that your re marking was influenced by the 
previous marks m the answer books?” none marked “Yes, very much ”, 3 
marked 1 Somewhat influenced ”, and 33 marked “Little or no influence ” 

3 At our suggestion, the examiners used green ink in marking the 
second time We inspected quite a few answer books, checking visually 
for the relationship between the green and red marks (Remember there 
were red crosses added where there had been ticks, ticks added where 
there had been crosses, and extra ticks, crosses, and underhnings added 
at random ) Most of the green x’s and ticks were, of course, near the old 
red x’s and ticks But some green marks were found where there were no 
red marks Most important, there were quite a few places where there 
were red marks but no green marks It still could be that the existence of 
a red mark (even though disguised, so the examiner could not tell if the 
original mark was favourable or unfavourable) called the examiner’s 
attention to the possible existence of a point to be marked However, 
he certainly used his own judgement as to whether or not to mark it 

4 In the most laborious process of all, we selected a sample of four 
sets of 50 answer books each for each subject They represented the 
“Same’ re-examiner with the highest examiner reliability, the “Same” 
with the lowest examiner reliability, the “Different” re examiner with the 
highest examiner reliability, and the “Different” with the lowest examiner 
reliability On the fifteen to twenty five pages of each of these 800 answer 
books, we counted every red X and underline mark (The number of ticks 
and of X’s was identical, except for the earlier ones where we had tried 
to disguise by crossing out the tick marks ) We hoped that this would 
give us correlation figures that we could use, with partial correlation 
techniques, to correct reader reliabilities for the influence of the previous 
examiners 

Unfortunately, the results of this study were ambiguous Had we 
foreseen the problem, and counted the marks before disguising and then 
Te-counted them after, we could have obtained more dearly mterpretable 
answers 

5 A similar study was done for the ninety marking ten data 

Our overall conclusion is just this There may have been some 
effect from the fact that the original markings, even though disguised. 



APPENDIX C 


287 


could not be removed from the examination books Ideally, of course, 
completely independent marking would require that the first examiner 
lca\e no marks in the answer book This factor may ha\e raised 
examiner reliabilities by as much as, say, five correlation points ( 05) 
But we cannot say positively that it did, and anyway the difference 
is negligible Even if the examiner reliabilities were raised by the 
fact that the examiners were not “completely independent they are 
still so low that no serious reliance can be put on the resulting marks 
for differentiating accurately among individual candidates 
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The following information, not included in this book, is available 
in computer pnnt-out form. If any scholar wishes to do further cal- 
culations, a copy may be obtained on loan from : Director, Bureau of 
Educational Research, Ewing Christian College, Allahabad, U.P, 
211003, India. 

Sums, sums of squares, and sums of cross-products for the three vari- 
ables correlated: marks of first examining, marks of second examin- 
ing, and number of pages in answer book, separately for each batch 
of 50 

Standard errors of the means, and standard errors of the differences 
for uncorrclated marks, for each batch of 50. 

Standard errors of the standard deviations, and standard errors of the 
differences, and their significance for uncorrelated standard devia- 
tions, for each batch of 50. 

Standard deviations of the differences in marks between pairs of 
examiners, for each batch of 50. 

Also hand calculations for the two formulas for the significance of the 
difference between correlated standard deviations. 

Anyone reading between the lines of this book will realize that certain 
other calculations are also available. Since we do not have dupli- 
cates, we do not list them here. If you desire anything special, for 
research purposes, however, please write. 
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self-consistency of examiners in, 218- 
220 ; self vs other disagreement of 
examiners in, 159-160, 163-164, 171, 
217-218 ; six equal batches in, compa- 
red, T32, 32-35, F33, 171-175, 7174, 
20 9, 72/0, 217-219, 7213, standard 
deviation of, 739, 41, 160-162, 7/70, 
210-213, 72/4, 218-219 ; standard error 
of measurement of, 42-48, F45, F46, 
170-171, 7/70, T214, 214, T2S8 , 
standard of marking in, 41, 161, 211, 
219 , styles of marking in, 164-165,213, 
218 ; summary of statistics on, 4, 6-7, 
7/70,209-216,72/4 

Bose, P. K., 160 

Brief summary: of Ninety Marking Ten, 
4-6, 68, 74 ; of Four Thousand Re- 
examined, 4, 6-7, 68-69, 71, 73-74 

Brinkmanship, educational, 40 ; in bio- 
logy, 209 : m Hindi, 194 n ; logic for 
avoiding, 194 n ; in mathematics, 224 , 
naivite of, in marking, 194 n 


Cambridge F.R.S and D.Sc., 61 
Card punching, 104-105 
Carelessness - m original marking, 235 , 
in second marking, 198 
Central agency, need for, 94 
Chance: cause of high failure rate, 13, 71 ; 

and pass-rail, 15-16, 70-71 
Chance vs merit, 5, 7, 13, 15-16, 47 ; in 
biology, 213 ; controlled by scaling, 13, 
16. 134 , 243, 247. 258 ; in Hindi. 199 ; 
in history, 12, J30, 134. 139, 183, 186 ; 
in mathematics, 230, 241 
Change, single easiest, 134, 242, 244, 259 
Chemistry examinations, 67, 770, 71-72, 77 
Chinese examinations, 54, 57 
Choice of optional questions, 93 ; deter- 
mines marks, 71, 84-85, 190-192, 220- 
221, 240-241 ; effect on reliability, 7, 8, 
47-48, 83, 167, 182-183. 222, 230 , effect 
on validity, 63-65, 85, 192, 240-241 ; 
guessing game, 192, 220, 241 ; reduction 
of effect of, 88 ; unjustified, 83 85, 241 
Civics examinations, 770, 75 
Class ; m Tripos, 61 See also Division 
Classes. See Divisions 
Code used, 103-104 
Coffman, W, 115 
Commerce examinations, 66 
Common batch : scaling method. 249 ; 

and standard deviation scaling, 251 
Comparison of both history - studies, 47-48, 
167, 182 


Computer, use of, 105-106 
Consistency, measures of : absolute, 80-81 * 
116,7/26 , relative, 80,7/26 
Constant, adding or subtracting, as a 
scaling method, 252-253, 255 
Content reliability Sec Reliability, content 
Content, standard error of See Standard 
error of content defined 
Contrast effect, 154 

Comcrsion of marks, example of, 251-252 
Conversion of rank orders to marks : 

example of, 256-257 , methods for, 256 
Conviction of infallibility in marking 
examinations, 2-3, 70 
Correct mark. See Mark, correct 
Correlation . between pairs of examiners 
(see Reliability, examiner) ; of examiners 
with criterion, 74, 150 ; of subjects at 
successive stages, 24, 78-79 
Correlation coefficient, inadequacy of, 
16-17, 80, 165, FI69, 169-I70, examples, 
17. 148,182, 198,213,229 
Creativity, unreliably measured, 62, 64 
Crosses and ticks. 67-68, 100-102, 109-110, 
effectiveness of crossing out, 102, 110, 
285-287 

Cumulative frequency curves, F34, J5-37 
Cumulative grade point average, predic- 
ting with objective vs essay tests, 64 
Cumulative records, recommended, 91 
Cureton, E E., ISO , 

Curriculum, improvement of, 81-82 


Data: available for analysis. J03, 230, 
289, reliability of, 99, 106-107, 111-112 
Dalta.S , 76, 132 n 

D'Cosra, A , 151 , 

Definitions of reliability concepts, 113-127, 
7/26 See also each concept 
DEPSE, 65, 92, 123 

Differences, absolute, 158-159, 7/70, mean 
of ,7770 

Differences in marks between examiners, 
67-70, all four subjects compared, 19-20, 
72/ , 158-161, 7 170, 112-115, 7288, 
average, 6, 19 20, 22, in biology, 72/4, 
214-216: examples of “agreement”, 185, 
201, 215, 233, in Hindi. 7200. 200-202, 
in history, 8-10, T9, 14, 17, T184, 184- 
188. largest, 4-6, T9, 72/, (see also 
largest differences) , in mathematics. 229, 
7232, 232-234, measurement of, 248-24 9 
See also Division; Divisions; Marks, 
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mean. Pass Percentages, Standard Devia- 
tions) (SD) 


Difficulty index, 72 

Difficulty of questions . paper setters cannot 
ludjte, 7, 82, 87. students cannot judge, 

82. 192. 221. 240-241 See also Item 
analysts 


analysts . _o 

Disagreement on best and poorest, 5, T9, 
13-15, m 68 


13-15, 774. oe 

“asKBSSrtaSstsr 

D “ l d c 'S' tfSST»3!£»r«. f"- 
s?iS4£a v, Thrf *» •« 

of question options, 240-241 

Distnbu.tons of mark. J« M.rto dot"- 
button of 


English examinations, 37, 60-65, 67*76, 
770 79, 81, 243. more reliable than 

nathmaln. “T'.Tm 
history, 128, 136. 141-142, 150-151 

Equal ability awarded different marks 
m all four subjects compared, 6, 19 22, 

in mathematics, 240-241 
rVin-itme methods and rank order selling. 
^255 See also Randomization and seal- 


Equi percentile scaling, 250 See also 
Rank order scaling 

Equivalent answer books, comparison or 
examiners on See Examiner X 


butionot r 

Division Korn®* «{ ffi&fo.i f. 

history. 12. 'iff gs untrustworthy. 
llSTiS 1 28,Ssb See also Divisions, 

Divisions. d ’*' r ‘omMred ViK.m biology, 

iSl’TiJJjW-lS* TW ' 186> 

matics, 227-223,7232 „ fouf 

Divisions, scatter cham^ 27 17). 
subjects “ m P“ r ® d, 77/tf, 216-218. 72/5. 

7772 in bl ®’ 203, 77£W. »" 
in Hindi. T202, 77*5. m 

history, 7^/86. 733,5 72J* 

mathematics, 234-231. r-r 

D Sc ,61 
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class correlation. 122 


class corre»uon. 

Econonuc 

Education examinations. . 85 , 

Education Commission. 37. • 

92, 271 c ee Bnok- 

Educational bnnknumshr 
mansh.p. educational 
Einstein, A , 47 n 
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different marks, example of, 17, differ- 
ences between (ice Differences in marks 
between examiners , also statistic sought) , 
instructions to, 101, 104, III; number 
needed, 262-264, T263, T264\ original, 
m history, 149-151, 775/, reliability of, 
when each marks a different question, 
260, 264-267, T266, remuneration of, 
104 lit, sample studies of. 136 n, t46, 
7/47. 150n, 151-154, selection of, 

97, 111, training of, 61, 64, 123, 269, 
who agree likely to be wrong, 260-262, 
264 

Examiners, Indian as good as others, 1, 75, 
151, 165, 246, reasons, 2 

Examiners, self-consistency of in biology* 
218-220, in Hindi, 20-1-205. in history* 
188-189, in mathematics, 237-239 

Examiner X (marking of six equal sets in 
each subject) all four subjects com- 
pared, 732 . 32-35, F33, 172-175, 7/74 ; 
biology, 209, 72/0, 72/4, 217-219, 72/8 , 
differences, 732, 733, 7/74 . experi- 
mental design, 171-172, Hindi, 194,7/94, 
7200, 203-205 , 7204 . history, 178, 
7/78, 7/84, 187-188, 7/88, 189 . largest 
differences, 732, 172-173, 7/74; match- 
mg of sets, 103, mathematics, 224, 
7224, 230, 7732, 235-236, 238, T238, 
and number of pages, 176-177 ; selection 
of. 98 

Experimental design, 4, 8, 19, 96-112, 
answer books, preparation of, 100-102, 
109-110 , answer books, selection of, 8, 
99-100, 109 , card punching, 104-105 , 
coding, I02-IO4 , computation, 10S- 
106 ; design, 98, 1 10-1 1 1 , despatch and 
receipt, 104, ,111 , dividing, 102 104 , 
examinations selected, 96-97, 108 109 , 
examiners selected, 97, 111 , examiner X, 
98,171 , formulas used, 106-107 (see 
also Formulas for), reliability of data, 
99, 1 06-107, 111-112 , reproduction, 108 


Factor analysis * or answer books, 151- 
152 , of examiners, 152-154 , of marks, 
66,73 

Fad See Pass-fail line , Pass vs fail 
Fail and First to same students m history, 
Til , physics, 67 , mathematics, 67 
Failure See Pass fail line , Pass vs fail 
Failure rate, 66 , caused by chance, 13, 
15, 70 , effect of randomization and 
seating on, 13, 134 See also Pass vs fail , 
Pass Percentages 
Findley, W , 113, 269 
First Cass, and F R S , 61 


First Division : compared with Third, 5-6, 
Til, 20, 25, 726-27, 28-31, 729-30; in 
history failed. Til ; m physics failed. 
67 ; in mathematics failed, 67 ; and 
failure only reliable difference, 31. See 
also Division, changes of, on re-marking, 
Divisions, scatter charts of 
Formulas for ; average rank order corre- 
lation with criterion, 150 , average rank 
order Intcrcorrclation, 145 ; content 
reliability, 122 ( see also Harper's content 
reliability formula), difference between 
correlated standard deviations, 106- 
107 , marks reliability, 124 ; reliability 
concepts, 7/26 , scales, relationship of, 
114; standard deviation, Jenkins' esti- 
mate, 103n , standard deviation of 
differences, 148n , standard deviation 
of difference scores, 125 ; standard 
deviation scaling, 1(4, 251 ; standard 
deviation of a sum, 100, standard error 
of marking (unsealed), 1 25, 7/26 ; l- test, 
106-107, 155-156 

Four Thousand Re-examined ; brief sum- 
mary, 4, 6-7 ; detailed reports, 155-241 ; 
experimental design, 96-107, highlights, 
19-50 ; summary tables, 7/70, 7/72, 
7/74 

Frequency distributions Sec Marks, distri- 
bution of 
F R S , 61 


Gauhati University See Taylor, HJ ; 
Misra, V S 

Gayen, AK.2, 770, 79, 81-82, 85-90, 99, 
167, 173 n , on choice of options, 85, 
190, 220, 240 ; on mathematics, 230 ; 
on reliability, 71-74, 76 

Geography examinations, 65, 770, 73, 
75-76, 83, 93 

Geometry examinations See Mathematics 
examinations 


Grace marks, 7, 40, 49, 88, 93, 134 ; effect 
of Taylor's substitute, 279 , percentage 
receiving, 99 ; Taylor’s substitute recom- 
mended, 270 , Taylor's substitute ex- 
plained, 276-284 

Grading * categories for reliable marking, 
7, on the cune, 160, 242, 244, recom- 
mended, 87 

Guess, students forced to, 82 

Guessing game, effect on marks, 192, 221, 
240 


Guilford, JP, 55, 76-17, 122. 136, 145. 
148, 172 n ’ 


Gulhksen, H on reliability, 55. 57, 113. 
115-116, 122, 124, 145 n, 165-166, 168 ■ 
on scaling 46, 173-174 n, 246, 249 
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Mi ° f ' “ 

* 4I i a, « ” 5 ffc< ?i and content reliability, m 

defined, 123, encouraged by Indian 

marking methods, 166 
Harper, A E, Jr , 770, 83, S5-86 93 JI5 
on choice of optional questions, 7l! 
I zJ^lS*!*** « o^ttvc examina- 
tions, 59 79, on different marking 

systems, 37, 41, 43, 160-161. on personal 
equation, calculation of, 239, 239 on 
scaling, 49-J0. 87-88, 161, 173-174 n 242. 
246, 249, 258; on standard error of 
measurement, 80 , 170 
Harper's content reliability formula 123 
166. 170, 273-275. effect of choice of 


46 i s - ^ J«. 7370, m. 
T2O0, T2SS , scaling of, J95, 200, 205; 
second” examiner in, 196 jclf-cossJj- 
tency of examiners in. 204-205, self >x 
other disagreement of examiners Jtr! 
139-160. 163-164, 171. 203; sit 

equal batches in, compared, T32. 32-35 
FX3, 171-J75, 7774. 194, Tm. 7 700 
203-205, 7204; standard deviation of 
739, 41, 1(0-162, T/70, 195-198, 7290, 
2M-205. standard error of mearurement 
or. 42-48. 745, 745. 170-171, 7170. T200, 
200. T2S3, standard of marking in, 4{, 
161, 197, 205, styles of marking in. 
164-J65, 198-199, 203; summary of 
statistics on, 4, 6-7, Tl?0, 195 202, 7709 


options on, 199, 274-275, effect of History examinations. 4, g-50. pass (n, 18, 


differences m parts vs in questions on, 
206, 241, effect of marking styles on, 
183, 198-199, 213, 230, empirical check 
on, 124, example of absurd result from 
182 

Head Examiner: better procedure, 86, 
247-249, defines mean and standard 
deviation, 252. role of. 246-247 
Highest mark in sample underestimated, 
173 

Highlights. See Summary, highlights 
Hitt, WH. 59, 63. 8 1-85, SS.90 
Hindi examinations. 6-7, 19 50, 68-69, 
71-74, 76, 160, 193-207, chance \s- 
merit in, 199: compared with other 
examinations, 23, 35, 41, 770. 157-177. 
195, 203, 213-214, 226. 228, 229 n, 230 
233-234. different marks awarded equal 
ability in, 19-22. 32-35; difficulty ot 
questions, sanation in, 206, 7705; 
dislnbuiion. actual vs intended In, 
161-162; Division m, changed on re- 
marking, 24-32, 179; Divisions, distn- 
bulions of, 7*39, 197, 7700, Divisions, 
scatter charts of. 24-28, 776. 171, 
7772, 7702. 202-203, 7704, educational 
brinkmanship in, 194 n. Fail and Second 
to same students in, 726; item analysis 
of. 205-207, 7706, largest differences 
between examiners in, T2I, 73 !. 35, 
T174, 200-201; marks in, differences 
between, 19-20. 777. 158-160. 7700, 200- 
202, marks in. distributions of. F34, 
35-42, F36. T33-39. 193-195, 719-1. 
means of. TJ9, 155-160. 7770, 195-198. 
7700, 204-205; misclasstficaiion of 

students in, 20-22, 200; paper, descrip- 
tion of, 193. 205, Pass marks in. 4j. 
194n, Pass Percentages in. 6, 72 J. 
23-24, 197, 7700; reliability, content, 
of, 77 0. 166-167. 7770. 1 98- 1 99, 7700. 
7783, reliability, examiner of, 770. 
164-165, r/70. 196. I9S. 7707. 7707. 
202-203, 7704, 7283; reliability, marks 


37, 65, 68-69. 7|. 73-76. 128 154. 168 169. 
178-192; both studies compared. 47-48, 
167, |82, chance vs merit in. 12, 130. 
134, 139, 183. 186. choice of options in, 
determines marks. 190-192, compared 
with English. 128. 135-136, 142, 150-151 
(ire also Taylor, H J . on multiple 
marking study compared with Ninety 
Marking Ten), compared with other 
examinations. 23. JJ, 4|, 770. 157-177. 
213, 226. 228, 229 n. 230, 233-234. 236; 
compared with other studies, 68. 770. 
108. 110, 113. 116. 123-129. 135-136. 
140-142, 146n, 150-151. 154. 168; 
different marks awarded equal ability 
in. 4-6, 79, 10. 19 22.32-33. 130. 190-192; 
difficulty of questions, variation in, 
189-192. 7/9/; distribution, actual vs. 
intended in. 161-162, Division in, 
changed on re-matkmg. 24-32. 779, 
Divisions, distributions of. Til. TJ9. 
132.T/JJ. 180-182.7/54. 186, Divisions, 
scatter charts of, 24-28. 775. 171. 
7/72, 7/86, 186-189. 7/55: Fad and 
First to same student in. Tit : Fail and 
Second to same students in. TX. Inter- 
examiner correlations in. 7/47; item 
analysts of. 189 192. T191, . largest 
differences between examiner* in. 4-8. 
79, 72/. 732, 35. 7/74. I84-IS5. marks 
in, differences between $ JO. 79. *7. 
19-20, 72/. 158-160. 7/54. 184-186. 
marks In distribution of. 9, FJS, 35-42. 
736. 738 39, 129 131. 7/30. 7/30. T173. 
178-179. 186. tS9. means of, 79. W. 739. 

69, T/JS. 135-141, 7/37, 7/47. M3, 
158-160, 7/70. 179-180. 182. 7/84. 188- 
189. miscbssification of Oudenti In. 6. 

10. 20-22. 130. 184; paper, description 
of. 178. 190. P** £?*?£*?* in \.^ 
7/2. 12-13. 22 24. 773. 135-136. Ml. 
151-182. 77<V. rclubi’ ty. content, of. 
170. 73. MS- M9. 166-167. 7/70. J E- 
183, 7/84. 7! ?W. reUNL’iy. examiner, M, 
16-17. 26, 770. M5-I4S, 7/47, 164-1 65. 
7W. tw-IC. 7/34, Tm. 18MS3. 
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T18B, 2S5, 7285, reliability, marks, of, 

42 F44, 47-48, 770, 149, 168, 7770, 183, 
T184, 7288, scaling of, 16-17. 7735, 143, 
179, 184, 189, 256-258, 7257, “second ’ 
examiner in, 180, self-consistency of 
examiners in, 188 189, self vs other 
disagreement of examiners in, 159 160, 
163-164, 171, 187 188 six equal 

batches in compared 732, 32 35. F33, 
171-175, 7777, 178, 7773, 187-188, 
7788, 189, standard deviation of T39, 
41, 13S 141, T133 , 7142, 143, 160-162, 
T170, , 179 182, 7184, 188 189, standard 
error of measurement of, 16-17, 42-48, 
F4S, F46, 141-144, 7142 , 170-171, 7170, 
183 184 , 7184, standard of marking in, 
41, 101, 181, 189, styles or marking in, 
164 J65, 183, 186, 188 189 summary of 
statistics on, 4 7, 7170, 179 186, 7184 
Hoyt and Stunkard, 122 
Improvement or examinations See Tradi 
tional type examinations, improvement 
of 

Indian Examiners, 2, 7, 75, 151, 165, 246 
Indian research, 65 95, conclusions of, 
90-91 

Infallibility, conviction of, in marking 
examinations, 2-3, 70 

Instructions to examiners 8, 19, 244-246, 
Head Examiner’s, 247-248, improve- 
ment needed, 269. need for specific, 160, 
165, 189, 205, 220, 236, recommended, 
86 

Intelligence tests, recommended, 91, 94 
Inter-correlation of different academic 
levels, 78-79, of papers, 77 
Internal assessment recommended, 89 91, 
scaling of, 89 

Intra-class correlation, 122 123, 148, 166, 
273 

Item analysis in biology, 220-222, 7222, 
and experimental design, 105, in Hindi, 
205-207 ,7206, in history, 189 192, 779/, 
in mathematics 239 241, 7240, method 
used for, 190 205, 220 239, negative 
validity, 192. 206, 221, 240, should be 
done routinely for public examinations, 
270 


J-cffect, 40-41, 86, 159 160 abolition of, 
logic for, 194 n. in biology, 209 210, 
and Division. 195, explanation of, 131, 
in Hindi, 194-196, in history. 41, J31. 
179 180. in mathematics. 224-226, and 
need for scaling, 254 training examiners 
to eliminate, 40-41 See also Revcrsc-J 
Jenkins* formula for estimating standard 
deviation, 103 n 


Journalists, need for, 95, 270 
Judgemental scaling, 238 239, 242, 244-246 
Judgement irt marking, 173, reduction 
of, 173 

Justification for study, 1-2 


Kelly, TL.,on reliability standards, 47 n, 
77 

Keynes, John Maynard, 47 n 
Kuder Richardson reliability, 122, 124 


Languages, examinations in classical, 770 
5V<r also under name of language 

Languages, examinations in regional, 65, 
770, 75 See aha Assamese, Bengali. 
Hindi 

Largest differences, 67 68 70, in all four 
subjects compared, 721, T32, 35, 173, 
T174, in biology, 215, in Hindi. 200-201 , 
in history, 4-5, 79, 184-185, in mathe- 
matics, 6, 232*233, significance of, 172- 
173, in six equal sets, T32, underesti- 
mated 173 n 
Law examinations, 64 
Liddle, S K V , 158 

Link test, 249, and standard deviation 
scaling, 251-252 
Literature, review of, 53 95 
Logic examinations, 69 
Luck See Chance, Chance vs merit 


Mahalanobis. PC, 37, 43 , 46, 49-50, 
87, |I4, 160, 236, 242, 258 

Maharashtra Board 65 66, 770, 75-76, 
85 86,89, 168 

Margin of error See Standard error of 
measurement (SEM) 

Mark, correct average of examiners, 10, 
15, correlated with standard error of 
marking, 144, defined, 115 116, 120, 
distinguished from true mark, 115 116, 
141 , and grace marks 276-279, 282 283, 
in history, 140-141, 7142, and original 
history examiners, 150, standard error of, 
140 

Mark, meaning of different for biology 
and mathematics, 41, 87, different for 
mathematics from English and history, 
37, 243, different for mathematics from 
Hindi, 41, different for different candi 
dates, 152 examiners 243 See also 
Scales used. Mathematics, scale diffe- 
rent from other subiects 



INDEX 


Mark, true defined, 42, I 15 1 16. 170, 

distinguished from correct mark, 115-116, 

141 


butioos of, Specified distributions of 
marks 


141 

Marking conviclion of “• 

scales used), variations 67-70, 

SmSfm y™ss,bl e l,n..nof.270- 

271 


Marking method* d «'« W?" SS 

mTSZS** «**£££• “&£f: 

62, impressionistic v- s^nab g ? (j „ 

«• ‘SSSSfflwSwinaitoM »m- 

ah0 Traditional type encourage 

ESSrtfcrt?*J<6. maximum, marks 
determined by V*?! 

S U 260-267. 

gj-wassa*****. 

two-stage. 245 9 


marks 

Marks mean of alt four subjects com- 
W 3 % 158 160 7770 m biology. 
gfSESu ™J. n« 218-219, corrrlajion 
or with standard deviations, 174, 
of disagreeing e«m,«rs irwre^cmaic 
than of agreeme 86-87, 
etTcct of d (fcrcnces m, on sianaaro 
demtion of error, 171. of examiners 
compared with mean of correct marks, 
hTtShI. Till, frcquefcy^f" 
bu'iMior.mJ.u! 'g’ 

I as i uu — studies, 67-70 

w^rnM 

255 variation in lot ■ «' 

F3J 34-35, 171 175 7774 

tJJ, » , answei 


tWO-Siag*. *■ . n 

Marks accuracy of 
average jeUted to ra ^ orde rs), 

with ranks (*«• Maria of questions, 
determined by differences 

190-192. ^‘A.^rences »n marks 
between, (see al ability awar- 

between exa t m J" c . rS LSat ability awarded 
ded different (see L 2,i cr to inferior 
*■*«* 259?* mSnum for only 

?, n rfKwS8^vi5”': 

144, of onsmal i'SStW 1 ”*, 

Till. r*« 3 5; d i40»,2S2. rehabi « 

of, 77 


pit 34-35.171 175 

Marks ra j ng ^°^ 0 ? V 6^29, n miumum. 
^no: minimum l» , Q> 

ability 132. relate M „ un - 

saas 


scaled. 257 258 ow 

Marks and rank orders ^SSgrcc- 

tween. 13 15 anddivtnbution 

-r;l«.»g, no. 2-200 

c.. Reliability, marks 


S£2S3^"— 

tionsfor,248 6 . 7> 19.50. 


art fS^^aar. 


SS « » g .fifs *«>• 
*“"S?«ffS« r •“US"!’.! in 


37. 240-241. f.VO, 

variation , tended in. 
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largest differences between examiners in, 

6, T21, 732, 35, 173, 7114 , 232-233, 
marks in, differences between, 6, 19-20, 
T2J, 1 58-160, 229, 7232. 232-234. marks 
in, distributions of, r34, 35-42, F36, 
738-39, 1 62, T224, 224-225, 234, meaning 
of marks in, compared to other subjects, 
37, 243, means of, T&. 158-160. 7770, 
226-227, 229, 7732. 234. 236-238, mts- 
classification of students in, 20-22, 71, 
231, "objectivity” of. 69. 223, 227; 
paper, description of, 224, 239. Pass 
marks too low in, 27-28; Pass Percen- 
tages in, 6, 22-24, 723. 228 T232. reli- 
ability of, compared with history, 6, 
reliability, content, of, 770, 72-73, ]66- 
167. 7770. 229 230, T232. 234. 7238, 
reliability, examiner, of, 770, 164-165, 
TJ70, 227, 229, T232. 234-237, T236, 
7235, 285 7255, reliability of, less than 
of English, 143, reliability, marks of, 

6 42. F44, 46-48, 770, 168,7770,230-231, 
7232 , 234, T288, scale different from 
other subjects, 7, F34. F36, 37. 41, 43, 
160 171, 173. 225, 237. 243 258. scale 
used in, 26-28, 31. 37 40, 225, 236-237, 
258; scaling, need for. 231, 238-239, 

* second * examiner in, 226; self-consis- 
tency of examiners m. 237-239; self \s 
other disagreement of examiners in, 
19. 159-160, 163-164, 171, 235-237, 
six equal batches in, compared, T32, 
32-35, F33. 171-175, 7774 , 224. T224, 
230, 235-236, 238.T2J5, standard devia- 
tion of. 779, 41. 160-162. TI 70 226-229. 
T232, 234 . 236-238. standard error of 
measurement of. 42-48. F45 . F46, 170- 
171, T/70, 23 J. T232. 234. 7288. stan- 
dard of marking in, 41, 161, 227-228, 
236, 239. styles of marking in, 164-165, 
230, 236. summary of statistics on, 4, 
6-7, 7770, 225-234, 7232, weighted 
double' triple in aggregate, 162, 258 

Maximum error See Errors in marks* 
maximum possible 

McNemar.0 . 107 


Mean errors, scaled and unsealed. 119- 
121; of marks (see Marks, mean), and 
standard deviation transformations, scal- 
ing by, 114,249 254 


Meaning of marks different in differed 
subjects See Mark, meaning of 

Measurement error: correcting grace marks 
lor, 276-280, routine calculation re- 
commended for public examinations, 270 
See also Errors m marking. Standard 
«ror of measurement (SEM) 


° r correlation coefficients, 162, 
4/70, of history marks, 140, TJ42 , 144 


Medicine, post-graduate examinations in, 
59 n 

Memory not a factor, 164 
Mem, order of See Rank orders; Rank 
order scaling 

Misefassification of students, 6, 10, 20-22, 
70-71, 130, 184, 200, 214, 231 
Misra, V S chapter by, 53-95; factor 
analysis by, 66, 73, on model answers, 
86, on questions. 82-83, on reliability, 
T70, 73. 76, 115. 122, 124, 242, his 
reliabilities compared with ours, 285; 
research with Taylor and others, 68, 
74, 78, 86-87, 128-129, on standard 
error of measurement, 80, on validity of 
Teacher Training courses, 79 
Model answers* differences m standard not 
controlled by. 189, 219, recommended, 
86, not supplied, 8, 19, 111 
Multiple choice questions, recommended, 
92 

Multiple examinations, candidates faced 
with, 123, 138, 183, 199, 213, 230 
Multiple marking, 74. 93. 260-267; effect 
or, 262 264, 7263. 7264, marking error 
reduced by, 5, 18, T264; number of 
examiners needed for, 262-264, 7263, 
7264. question-wise marking compared 
wnh 265 267, reliability, effect on, 60-64, 
86, 260, 266, by two \s three examiners, 
86-87, 93 (see also T263, T264), validity, 
effect on, 63-64 


NCERT. 85-86, 92 93 
Ninety Marking Ten brief summary of, 
4-6, 68, compared wnh Taylor et. al, 
68, 108. 110, 113, 116. 128-129, J 35-136; 
140-142, J46n. 150-151, 154, detailed 
report of, 128-154; experimental design 
of, 107-112, highlights or, 8-18, selec- 
tion of examiners for, 8, 108-109 
Non-linearity See Item analysis, negative 
validity 

Normalized scaling, 131, 250, 253-259 


Objective examinations* mam advantage of, 
167, recommended, 85-86, 92, 268 
Objective vs traditional examinations 
combined use recommended, 63; com- 
parative reliabilities of, 6, 25. 56-58, 
166-168. 183, 199. 213-2(4, 230-231, 
comparative validities of, 6f, 63-65, 79, 
conversion of traditional to objective, 
58-59, 64, correlation with cumulative 
grade point average, 64. reliability con- 
cepts in, 47, 115-117, reliability of mark- 
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mg, 16, research on, 91, sampling, 167. 
standard errors of measurement of, 43. 
F46, validity of, in predicting essay 
marks. 63-65, 79, what is measured by? 
58-59,62,64,60,83 

Open-book examinations, recommended, 
Optional questions See Choice of optional 

140, TJ40, T142, J 54 

orders. Rank order scaling 
Original examiner, in h.story, 8. 149-151. 
7757 

Other-agreement See Examiner agreement. 

self vs other 
Oxford University, 61 
Pages, number of. vs marks. 9.49, !«. 

tMfvrs mter-correlation of, 77 , u 
p Of. 82 83 . selection of. 96-9J. 
108-109 „ , 

Pass-fail l ' nc J "jsu^ris find chance, 

119-140 ; disagreement on. M>. ". -> • 
l^faiVes due to 

matriculation vs 

Failure rate lowest m 

IMS 


« all four 

12-13, 135-136, 1 • measures of 

“EKShu f?a.229 1 raisrdby 


12-13, i 7-7 j ? • as measures ui 

mathematics, --8. 723.^ raK ed by 

teaching quality. 12* ^, ingt 87 , jT4, 
randomia" 0 " ^ ,his study, 99 , 

3ESSBjfos3S!S 

ffiT’eV’Sk""” i". of ' 

12-13. 245 


Passing probability, calculation of, 278- 
279, 7282, 283 

Percentage In each Division, routine. 

calculation recommended, 270 
Per cent agreement See Division, per cent 
agreement on 

Percentile ranks • changed b«au« "« rk * 
unsealed, 161-162 ; scale, 87, 254 
Persistence effect , 83, 93, 1 10, 154 
Personal equation, 239, 243 ; determining. 
249 

Ph>sics examinations, 67, 770, 71-72, 77, 
85,93 J . 

Pooling of data distributions widened 
because of. 130 . effect of, on , , 

FJ69, 169-170 . justification for. 24 
••Popularity” of questions, 192, 221 
posiuon effects, 154 See ■ «to Or*r of 
marking, first and second compared 
Probabilities, multipbcation of, 280. 283- 
284 . 

Psychometrics 'psychometry. nature cl. M 
Publications for examination reform. 9. 
Public examinat.o^ - m.sctodto.ion .m 

Publraty. need for. 95. 270 
Purpose bt this research. 1-J. 8.96 

Questionnaire, administered to examiner,. 

Question parers See Paper, description 

Quest tons, vanannn 1 ^;, ^ of , 0 „ 
biology. . effeetof. 

marks. K’tllJ in Hindi. ' 

on reliability. 8* •!, j t gj in mathe- 
•n history. 

manes. I”.- 40 '™ JM.V67.77M: 

Quest, on-»ise '"['“‘m 266-767 : vs 
cheaper and •‘,66.266 ; Radha 

multiple parking 
Krishna s suggestion. . 

Qutub Mmar effect, 40 


_ ... r of Kanpur Unir ) 

Ra <5?iSn-wi« j", Um . 
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Rahman, H , 40 
Ramanujan, 47 n 

" 8 ; to.' 248-250 

raises Pass Percentage, 134 

TH6&S32&& 

SSSS 5 ! 

of *te awarded angle answer 
tTld 714 14 , VS standard deviation, 
m n or' iotal error. 126 , unrelated to 
miS 136-138, FW See o bo Marks, 

Tange of , , . 

R s,^s,rr^rSr3 

140, 7140 , compared with marks, 14-15, 

140 7742. 255 , correlations of, T147 , 
examiners disagree on, 13-14, 68 mean 
of, 140. 7142, 143-144 , range °!, T14 
14 , standard deviation of, 140, T142, 

143 

Rank order “a'™’ . ‘ 5 ' JvV^dfecl 
advantages of, 16, 144, 255-259 , eHec 
on marks distribution of.T2.57 
Rao, C R , 106 
Raw marks, 245-246, 249-252 
Reader reliability See Reliability, examiner 
Recommendations, 81-95. 242-271 • « deter- 
mining marks reliability, 124-125 , eli 
mmation of optional questions, 83 sa, 
241, 269 , if examiners disagree, re- 
. mark entire batch, 262 , gracx marks, 
Taylor's substitute for, 134, 27U, z/o- 
284 , increase number of questions. 83, 
269 ; instructions to examiners, ao, o » ' , 
miscellaneous, 268 . on multiple mark- 
mg, 260-267 . multiple mark.ng. when 
to appoint third examiner, 262 , outlaw 
J-elfect, 40-tl ; question wise multiple 
marking, 260, 264-267 : routine stans- 
tics to be calculated for every public 
examination. 270 , scaling. 69, 87 88, 
134. 242-259 ; sclf-expenmentation. 3. 
269 , specifying 

205, 220, 238-239, 244-246, 250 .substi 
tute five or six categories f or Jf°2” a J? 
scale, 7 ; training of examiners, 61, w. 
123,269 

Re-exammation of answer books, dis- 
advantages of, 32 . failures would pass 

Reform : consciousness for, 91-92 , single 
easiest, 242, 244, 259 
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Relative merit See Rank orders 

STST-t history «am,na.,ons, h 6, 
Indi'an' studies' of. 77 90 

matics examinations, 6, 4 ♦ kc< j by 

raised if each Q u “ 2 66 See also, 
difTerent examiner, 260. effect on 

Choice of °rt,rSSntS Rdl ' 
reliability , Reliability, con ^ 
ability, examiner , Reliaoimy. 

"ttS-^SSW! 

T126 ; of k cocffiaent 0 f 

T75, 76, 7 ®V 9 117 169-1701 concept 
inadequacy of. H '• * am tnations, 113- 

apphed to traditional e«m (ners> factors 

127 , defined, 55 , oi *27. function, 
involved in, 180, . dcx of 1500, 

of test length, 5 1 . exami nalions 
meaning of, m object^ 4?> 76 _ 77> 165. 

168 ; minimum **g u SSity of, 80-81; 
270-271 ; reader! defined, 

raised by s ?*] ,n Reliability. examiner) j 

Sipkrahle ln mhabiMy^ ^ 

7, 8, 47-48. 149, 167, 182, ^ 

for. 122’ r/26, 273*27 5 , 99 T2QQ f 

and. 123 . /u Hi* ^83, 7184 1°! 
history, 73, 14» ‘ » ^^matics, 770. 
link test, 249 , „ 234 for mathc- 

compared ^h Gayen and Mism, 

'jifS'S S Harper's reliability 
formula . 

R «S'Sfa?jS^^Par;?; 

1 fit-165 7770, for biology, 77U , r ~A 
2I3 T2M T2M. 216-218 .12/8. d *' d ' 
46 114-115, differs from oht' c " v ' 

^mfitLrel,abd,.y.ll4,166^« c ^ 
that lower, 86, highest found i fynn 
studies, 62. for Hindi, 196. 198, T2W, 
7702, 202-203, T204, for history, 16-17, 
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26, 1 45.148, TJ47. 180-182. T184, 7755. 
186-188, 7/88. for history compared 
with Mura's study. 285. increased fey 
special training. 7, 166 [see also Tradi- 
tional-t)pe examinations), intcr-cxacru- 
ncr consistency (D) defined, 1 14.117, 
122, 7775, and marking criteria used. 
62: for mathematics 770, 227, 229 
T232, 234-237, T2J6, 173S for mathe- 
matics compared with Misnfs stud), 
285. methods for increasing, free Tradi- 
tional type examinations, improvement 
of), misleading if marts unsealed, 
117-118, reduction in scripts won t 
improve. 69. routine calculation of, 
recommended Tor public examinations, 
270, scaling more important than, 259, 
self-consistency (S) defined, 115 117. 
122, 7/26, self vs other, 61; table of, 
7285 

Reliability, marts, 77 0, 75-77. of oil four 
subjects compared, 42, F44. 4648, 
167- 16S. T170, of biology. 770. 213-214. 
T274, definition and calculation of, 47, 
US- 1 16, 124, T126. 168 , 183, 199, 213. 
230. of Hindi, 199, 7200. of hui ory. 
16-17, 144. 149, 18), 7784; lowered for 
unsealed marts. 118, lowered by vague 
Questions 82, of mathematics. T70, 
230-231, 7232. in objective examinations. 
t63. routine calculation of. recommen- 
ded for public examinations. 270, table 
of, T288; of traditional examinations 
comparable to objective examination 
reliability, 47, 57, II 5. 166. f 68 
Reliability of marking. See Reliability, 
examiner 

Reliability, total See Reliability, mark* 
Reproduction, methods of, JOS, 1 10 
Research design See fxpenmema! design 
Research on examinations earliest. 54, 
efficiency aspects 66-81. forergrr, .3-61, 
functional aspects, 65-66. handicaps to. 
94. improvement. 81-90. Indian, 65-91 
and thu book. Interpretation of. 95. 
needed, 94-95; summary, 90-91 
Reverse-/, in biology. 209, jn Hind , 195, 

>n history, 131. 179. in mat hematic*, 
225 See oho /-effect 
Review ofthe literature, 5) 95 
Rotaprint, 10? 

“Same" W " Different" te-exanuner See 
Examiner agreement, self vs. other 
Sample adequate^ representative. 99- 
100, 109. 136. of five hundred adequate 
for routine examination statist*?*, 270, 
study of stratified random, H6 n, 
146. 146 n. T14T, f*0n. I5M54 See 
afvo I xpertmen'al design 


Samples, matching of, 102 103 
Sampling in objective n. traditions 
examinations, 167 
Sanskrit examinations, 770. 72, 75 


converting non linear to linear 
255 Sre oho Rank order scaling 

Scales used. 7, 4|; adding dilferrn*. 7, 
35. 259, for biology, 209. effect on Pas* 
Percentage. »/, 245. effxt on standard 
error of measurement. 171; examines 
differ in, 136-1*9, for Hindi, 195, for 
history, 16-17. T/M. J43. 179. for 
mathematics, 26-28. 31. 37. 40. 2 25. 23&- 
237, 258, for mathematics and for biology 
compared. 41. 237. for mathematic* 
different from other subjects 7. FU, 
F36, 37. 41, 4J, 160. 171. |73, 225, 237, 
243, 258 for mathematics and for 
English compared. 37, 243, reason for 
wide variety in, 37, 244. for science 
marls higher than for ant, tS8 
Scaling 2. 242 259, defined. 114. 24J; 
of different tuk/eeu to common stan- 
dard, 258 259. effect on Pass Percen- 
tage*. 13. 87. 134, effect of, example, 
251-252. 7757. 257-258 . effect of not, 
17. 143. 161, 253-259. examiners. 242. 
244. examrk of. 251-252. 7257, formula 
(see Scaling methods) Importance of, 
more than of examiner rehabi’ity, 259, 
of Individual question*. 87, 221. 241; 
of mathematics, 27 23.37-40 randomua- 
non and. 13. 9J, t/4, 2/3. 245-250. 
recommended. 69. 8145, 134. 242 257, 
reduces standard error of measurement, 
46. 53, and reliability. 81. Jl7, 146-147. 
234. 242, iimrlioty of 242, single rtuevt 
change. 242. 244. 259 and s'anttrd 
errors, IJ8 J2I, subjects 242. 244. 2<3- 
259 

Scaling need for, 4’} *0 69 87 88. 25*. 
238. in biology. 214. 219-220 In Hindi. 
200 205. in h dory. 184 |*9. hv mathe- 
matics. 231. 238-239 See aim Scale** 
Scaling method!. 93. 244-259, by aiiifft 
or subtracting a constant. 25- - J. 
common bat-h. 249, 23 1 : «*T*resl wuh 
temperature scale conversion 
49-50, 114. |73*1 74 24J. described. 250. 

a ur-percermte. 2*0. formula 7 •- F’ect 
. \ 39. •grading on f* cur* , tu\ 
242. 244. rndf-mema! 2JJ.r* 

2 U 246. hekrext.249 2! 

method, of. teed JS 

mean c4 nark*. -51 — 

standard dev unco If 4. 

1/1, 249-254. nxwunl. »*. 
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253 259, randomization and scaling, 13, 
93, 134 238, 248 250, random sample 
adequate for, 249 rank order scaling, 
15 16 87, 144, 254-259, T257, raw vs 
scaled marks 245 246, 249 252, re- 
examining small samples 247-248 251 , 
role of Head Examiner, 246-248, simpli- 
city of, 242, specifying distributions or 
marks J89 205, 219, 238 239 244-246, 
250, standard deviation scaling, 114, 
131, 249 254, standard scores. 250, 
use of selected samples in, 247, 249, 
without randomizing scripts, 248 249 
Scatter charts of Divisions See Divisions, 
scatter charts of 

Scholastic aptitude tests, recommended, 91 
Science examinations, 65 69 70,770,75,78 
85,93 158 See also Biology, Chemistry, 
Mathematics, Medicine, postgraduate. 
Physics 

Score reliability See Reliability, marks 
Score, true, 170 See also Mark, true 
SD See Standard deviation^) (SD) 
Second Divisioners awarded Distinction 
and First to Fail T9, T26 27, F29 30, 43 
' Second ' examiner, consistent features of 
in biology, 210, in Hindi 196, in history, 
180, in mathematics, 226 
Self agreement See Examiner agreement, 
self \ s other , Examiners, self-consistency 
of 

Sclf-consistency See Examiners, self 
consistency of 

Self-correlation See Reliability, Examiner 
Self-experimentation recommended, 3, 269 
SEM See Standard error of measurement 
(SEM) 

Semester system recommended, 90 
Short answer examinations recommended, 
92 reliability and validity of, 269 
Significance, statistical vs educational, 159, 
164 n 177 

Six equal batches marking of See 
Examiner X 
Skewness, 144 

Snedecor.G M , 136, 138, 146 
Social studies examinations, T70, IS 
Spearman Brown Prophecy fonjiula 57 58 
Specified distributions of marks, 244-246, 
250, method, 244-246, need to specify 
189, 205, 219, 238-239 
Snvastava, ABL, 107 
Standard deviation(s) (SD) of all four 
subjects compared, T39, 41, 160-162, 


T170, of biology, 210-213, T214, 218 - 
219, correlation with means, 174, 
determines weight of paper, 160-162, 
253 258. estimated from range, 136, 
7738, effect of, on standard error of 
measurement, 171. of Examiner X, 
F33, 34-35 172174. 7774. of Hindi, 
195, 198,7700 204-205, of history, 135- 
141. 7738, TJ42, 143, 179 182, TJ84, 
188 189, of history correct marks, 7742, 
of mathematics, 226-229, T232, 236- 
238, vs. range, 172 n, routine calcula- 
tion of, recommended for public exa- 
minations 270, used in scaling {see 
Standard deviation scaling) variations in, 
in six equal sets, F33, 34-35 

Standard deviation scaling, 114, 249 254, 
advantages of, 253, disadvantages of, 
114, 131, 254 examples of, 252, for- 
mulas for, 114,251 

Standard error of combined marks, 7726 

Standard error of content defined, 116, 
7726 

Standard error of estimate, more stable 
than r, 170 

Standard error of estimation, 116 n See 
also Standard error or marking 

Standard error of marking defined, 116, 
example of, 148, formulas for, 125, T126 , 
m history, 141 142, T142, 144. 148, 
scaled (S) and sealed (D) defined, 118- 
121, 125, 7726, and standard deviation, 
141, unsealed (D) defined, 119 121, 
125, 7726 

Standard error of marks, 116-118, 121, 
125 126, T126 

Standard error of measurement (SEM), 
116-118, 121, for all four subjects com- 
pared, 42-48. F45. F46, 170-171, TJ70, 
absolute consistency, measure of, 80-81, 
116, actual errors larger than, 48, 171, 
best indicator of reliability, 80-81, 116, 

170, for biology, 214 T214, correc- 
ting grace marks for, 277 280, defined, 
42, for Hindi, 200, 7200, for history, 
16-17,149 183 184, 7784, interpretation 
of, 42 170 largest for mathematics, 
43-46 F45, F46, 171, for mathematics, 
231, T232, 234, minimum estimates of, 

171, for objective examinations, F46, 
reduced by scaling 46, table of, T288, 
theoretical distribution of errors, 282- 
283 See oho Standard error or content 
defined. Standard error of marking, 
Standard error of marks 

Standard of marking 90, 173 174, in 
all four subjects compared, 35, 161, 
average 159 160, in biology, 211. 219, 
constant characteristic of examiner, 86, 
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189, 219, and Head Eramaier, 246*247, 

in Hindi, 197, °?16 239 ?a 

in mathematics, 227*228, 236. 239, m 
this study. 111, variations In 86, 90, 
205,219 


Standards, APA-AERA-NCME, 113 
Standards for reliability, 47, 76-77, 165 
Standard scores, 250 
Stanmes, 250, 254 


Temperature scales as examples of scaling, 
7, 27 49 50, 114 173 174, 243, conver- 
sion formula 243 

Terms which are used interchangeably, S6 
mathematics, T27, 67 


biamncs, — - ■ 

of 500 adequate for, 270 


mathematics, i u, oi 

Third examiner question-wise martins 
better than 266-261 rKromendaMn 
on when to appoint. 20. W >» r * 
argument on, simplified, 261 262, un 
justified when two examiners differ, 
262 264 


of 500 adequate wi, 

Studies compared, both history. 4748- 
167. 182 


262 264 

illustrated, 261 264,7263 


Thorndike. R-L. f® > 16 ’ 118 ' 

reliability standards, 47 n 


164-165 in btoiogy, reliaeimy suu*wu», - - 

Total suability Tee ****»■ ■»“* 
«l» VMM <>' ma-kms Tmd itional type gmjS w S”ob,“ - 

- c»t.-<-tion of. 96-97, wilhobjKtive «.«mpar g 

MHHMH 


alw &uiwh» - 

Subjects studied, 4. selection of. 96-9 . 
103 109 


103 JW 

Summary brief. 4-7, 68 69^ ^ a j,,|,iy 


64’, defined 11 J »VE3«1M 270.’ 

Ig^^fflkoRcconOTendatioM.Sc^ng 

methods), ,n * l ™ t ctl 269 poor reliability 
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